https://www.kaggle.com/c/home-credit-default-risk
This competition is sponsored by Home Credit, whose mission is to provide a positive and safe borrowing experience to groups of people that traditional, mainstream banks and financial institutions typically refuse to serve.
In order to make lending decisions on applicants from this demographic, Home Credit needs an algorithm that will take as inputs various financial and personal information originally taken from a loan applicant's profile, and then compute a probability that the applicant will have trouble paying back the loan. This probability will be in the range [0.0, 1.0], where 1.0 represents a 100% certainty that the applicant will have repayment difficulties and 0.0 indicates that there is zero chance that the applicant will ever miss any payments. The algorithm will be tested and ranked on Kaggle based on a set predictions it makes for 48,744 individuals who previously borrowed from Home Credit.
Solution algorithms will be trained on a set of datapoints from 307,511 previous Home Credit borrowers. It is imperative that some portion, say 20%, of the training set is set aside to serve as a validation set. Alternatively, an algorithm such as K-Fold Cross Validation could be used.
To submit a solution on Kaggle, a CSV file must be produced that contains one header row, and 48,744 prediction rows, where each prediction row contains both a user ID, the SKI_ID_CURR column, and the probability, the TARGET column, of that user having repayment defficulties. The file must be formatted as follows:
SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
Home Credit knows which borrowers in the test set were delinquent, and which ones never made a late payment. A good algorithm will need to predict a high probability of delinquent repayment for the majority of borrowers who did in fact make late payments (those whose TARGET value is 1 in the main table in the dataset). This algorithm will also need to predict a low probability of delinquent repayment for the majority of borrowers who never made a late payment (those whose TARGET value is 0 in the main table in the dataset).
# Import libraries necessary for this project.
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
# Import supplementary visualizations code visuals.py
import visuals as vs
# Display matplotlib plots inline in this notebook.
%matplotlib inline
# Make plots display well on retina displays
%config InlineBackend.figure_format = 'retina'
# Set dpi of plots displayed inline
mpl.rcParams['figure.dpi'] = 300
# Configure style of plots
plt.style.use('fivethirtyeight')
# Make plots smaller
sns.set_context('paper')
# Allows the use of display() for dataframes.
from IPython.display import display
# Have all columns appear when dataframes are displayed.
pd.set_option('display.max_columns', None)
# Have 100 rows appear when a dataframe is displayed
pd.set_option('display.max_rows', 500)
# Display dimensions whenever a dataframe is printed out.
pd.set_option('display.show_dimensions', True)
From https://www.kaggle.com/c/home-credit-default-risk/data:
bureau_balance.csv
previous_application.csv
POS_CASH_balance.csv
installments_payments.csv
credit_card_balance.csv

# Load the main data tables
application_train_data = pd.read_csv("data/application_train.csv")
application_test_data = pd.read_csv("data/application_test.csv")
# Load the Bureau data table
bureau_data = pd.read_csv("data/bureau.csv")
# Load all other data tables
bureau_balance_data = pd.read_csv("data/bureau_balance.csv")
previous_application_data = pd.read_csv("data/previous_application.csv")
POS_CASH_balance_data = pd.read_csv("data/POS_CASH_balance.csv")
installments_payments_data = pd.read_csv("data/installments_payments.csv")
credit_card_balance_data = pd.read_csv("data/credit_card_balance.csv")
# Total number of entries in training group
print("Total number of entries in training group: {}".format(application_train_data.shape[0]))
# Total number of entries in test group
print("Total number of entries in test group: {}".format(application_test_data.shape[0]))
# Total number of features in the main (application) data table
print("Total number of features in main (application) data table: {}".format(application_train_data.shape[1]))
The first two features in the main data table training group, SK_ID_CURR and TARGET, represent the borrower's ID number and target data (whether or not they made at least one late payment), respectively.
There are therefore 120 features in the main data table that can be used to predict a borrowers' targets.
# Display the first 500 records
display(application_train_data.head(n=500))
# Display the above data sample table, but with axes transposed and limited to 5 records, so that the table can
# be included in the project's writeup.
display(application_train_data.head(n=5).transpose())
# Display a statistical description of the numerical features, along with all features that
# have alredy been one-hot encoded, in the main (application) data table.
display(application_train_data.describe())
# Display the above statistical description table, but with axes inverted, so that the table can
# be included in the project's writeup.
display(application_train_data.describe().transpose())
# Count number of delinquent repayers ('TARGET' value of 1) and non-delinquent repayers
# ('TARGET' value of 0) in the training set of the main data table.
application_train_data['TARGET'].value_counts()
# Fraction of applicants in training set who were delinquent repayers. If you took a random
# sample, this is the probability you would select a delinquent repayer by chance:
repayers_fraction = round(24825/(24825+282686), 4)
print('Fraction of training set who were delinquent payers: {}'.format(repayers_fraction))
Do any features have mostly 'NaN' for their entries -- are any features too sparse to be of use?
# Get numerical counts of number of NaN entries in each column (feature) in the main data table.
features_sorted_by_NaN_count = application_train_data.isnull().sum().sort_values(ascending=False)
# Display only features with NaN counts greater than 0.
feature_NaN_counts = features_sorted_by_NaN_count[features_sorted_by_NaN_count > 0]
# Create a dataframe to summarize 'NaN' entries of features in the main data table (application_train_data)
feature_NaN_summary = pd.DataFrame(index=feature_NaN_counts.index, columns=['#_NaN_Entries','Fraction_of_entries_that_are_NaN','#_NaN_entries_who_are_Delinquent','Fraction_of_NaN_entries_who_are_Delinquent','#non_NaN_Entries','#non_NaN_entries_who_are_Delinquent','Fraction_of_non_NaN_entries_who_are_Delinquent'])
# Fill each row in the NaN summary dataframe
for feature_name in NaN_summary.index:
# Get the amount and fraction of delinquents among all borrowers who
# have an 'NaN' entry for a particular feature. Do this for each feature that has at least
# one 'NaN' entry.
number_of_NaN = feature_NaN_counts.loc[feature_name]
feature_NaN_summary['#_NaN_Entries'][feature_name] = number_of_NaN
number_delinquents_who_are_NaN = application_train_data[(application_train_data[feature_name].isnull()) & (application_train_data['TARGET'] == 1)].shape[0]
feature_NaN_summary['#_NaN_entries_who_are_Delinquent'][feature_name] = number_delinquents_who_are_NaN
fraction_of_NaN_entries_who_are_delinquents = round(number_delinquents_who_are_NaN/number_of_NaN,4)
feature_NaN_summary['Fraction_of_NaN_entries_who_are_Delinquent'][feature_name] = fraction_of_NaN_entries_who_are_delinquents
# Get the amount of non-'NaN' entries in each feature that
# has at least one 'NaN' entry.
number_of_records = application_train_data[feature_name].shape[0]
number_of_non_NaN = number_of_records - number_of_NaN
feature_NaN_summary['#non_NaN_Entries'][feature_name] = number_of_non_NaN
# Get the fraction of the total entries for a feature that are 'NaN'
fraction_of_feature_entries_that_are_NaN = round(number_of_NaN/(number_of_NaN+number_of_non_NaN),4)
feature_NaN_summary['Fraction_of_entries_that_are_NaN'][feature_name] = fraction_of_feature_entries_that_are_NaN
# Get the amount and fraction of delinquents among all borrowers who
# have a non-'NaN' entry for a particular feature. Do this for each feature that has at least
# one 'NaN' entry.
number_delinquents_who_are_not_NaN = application_train_data[(application_train_data[feature_name].notnull()) & (application_train_data['TARGET'] == 1)].shape[0]
feature_NaN_summary['#non_NaN_entries_who_are_Delinquent'][feature_name] = number_delinquents_who_are_not_NaN
fraction_of_non_NaN_entries_who_are_delinquents = round(number_delinquents_who_are_not_NaN/number_of_non_NaN,4)
feature_NaN_summary['Fraction_of_non_NaN_entries_who_are_Delinquent'][feature_name] = fraction_of_NaN_entries_who_are_delinquents
# Display the NaN summary dataframe below
display(feature_NaN_summary, 'display.max_columns')
The goal of creating the above summary of all 'NaN' entries in the main data table was to investigate three specific questions concerning these 67 features that each have at least one 'NaN' entry:
COMMONAREA_MEDI, may have too few non-'NaN' entries that belong to delinquent payers. If too small a portion of the training set's target segment is included in these features, it's hard to see how they will be useful in making predictions that generalize to unseen datapoints.In the case of question 1., I can confirm that for all 67 features, a borrower simply having an 'NaN' entry for a feature does not in any way shape or form predict whether the borrower will be more or less likely to be delinquent. I verified this for each feature by looking at the fractions of both its 'NaN' and non-'NaN' cohorts that were delinquents. For each feature, I found that the fraction of 'NaN' borrowers who were delinquent was identical to the fraction of non-'NaN' borrowers who were delinquent. If being 'NaN' for a particular feature were to have any chance of being a meaningful predictor of delinquency, I would have expected these two proportions to have had a stastically significant difference for that feature.
For question 2., it's helpful to remember that, as confirmed above, there are 24,825 borrowers in the training dataset who were delinquent repayers (who had a TARGET value of 1). Any feature that I retain in spite of its 'NaN' entries needs to still have a large enough amount of non-'NaN' entries that belong to delinquents. If the entire training set's delinquent population is not adequately represented among a particular feature's valid data points, it's unlikely that any predictions meaningfully informed by this feature would generalize well to unseen datapoints. The effective training set size for this feature would be just too small, and my model would be at risk of underfitting.
However, the big question is: what should the cutoff line be? What fraction of the 24,825 delinquent borrowers in the training set need to be captured by a feature's valid data points in order for the feature to have a chance of being useful in making predictions? There is no general rule of thumb that I can use to answer this for each of these features that has 'NaN' values. The short answer is, it depends -- on many factors such as distribution of the feature's valid data, as well as the type of classifier algorithm I'm using.
This is why for the time being I won't remove any of these features -- even features like COMMONAREA_MEDI that contain 'NaN' in over two-thirds of their entries. Instead of running detailed statistical analyses on these features' distributions, I will instead experiment with dimensionality reduction algorithms such as PCA and/or feature selection algorithms such as SelectKBest.
If a feature had had, say, 95% of its entries as 'NaN', I probably would have removed it from the dataset at this point. However, since the sparsest features in the main data table still have valid data in just over 30% of their entries, I don't want to prematurely remove a feature that may have a chance, however remote, of possibly contributing to useful predictions.
Finally, to answer question 3., I found that all but one of the 46 normalized numerical features contain 'NaN' values. About half of the 21 non-normalized numerical features contain 'NaN' entries. Only 3 out of the 15 categorical features that will need to be one-hot encoded contain 'NaN' entries. None of the already one-hot encoded categorical features contain 'NaN' entries.
Do any borrowers have mostly 'NaN' entries for their feature data -- could any datapoints representing borrowers be classified as outliers and removed because their feature data is too sparse?
# Get numerical counts of number of NaN entries in each row (borrower) in the main data table.
borrowers_sorted_by_NaN_count = application_train_data.isnull().sum(axis=1).sort_values(ascending=False)
# Display the number of 'NaN' entries for each borrower (left column
# is borrower ID, right column is number of 'NaN' entries).
display(borrowers_sorted_by_NaN_count)
# Plot a histogram showing the number of borrowers that have a
# certain amount of their feature data as 'NaN' entries.
plt.figure(figsize = (10,6))
plt.hist(borrowers_sorted_by_NaN_count)
plt.title('Borrowers Missing Main Data Table Features')
plt.xlabel('Number of Features Specified as \'NaN\'')
plt.ylabel('Number of Borrowers')
plt.savefig('borrowersnandata.png')
plt.show()
The above plot confirms that no borrowers are missing too many features such that they'd be considered outliers and would need to be removed from the trainingset.
At most, a borrower may be missing roughly only half of the main data table's 120 features. Just under 20,000 of the training dataset's 307,511 borrower records face this "worst-case" scenario. As it is, having 61 missing features falls well below the threshold at which I would decide to remove a borrower from the training dataset for having feature data that is too sparse. A borrower would have to be missing over 100 features, at least 5/6 of the featureset, for me to take the time to explore more deeply whether they may be an outlier.
I explored samples from all features, as well as statistical descriptions of each numerical feature in the main data table (count, mean, standard deviation, minimum value, 25th percentile, 50th percentile, 75th percentile, and maximum value) to ensure that no features contained unexpected values that fall outside the range one would expect based on the feature's definition.
Examples of unexpected values include entries that are impossibly small/large, or values that are negative when only positive values would be expected.
I came across the following five anomalies:
The following five numerical features indicate the number of days prior to the loan application's submission that a particular event took place:
DAYS_BIRTHDAYS_EMPLOYEDDAYS_REGISTRATIONDAYS_ID_PUBLISHDAYS_LAST_PHONE_CHANGEFor example, DAYS_LAST_PHONE_CHANGE is defined by Home Credit as: "How many days before application did client change phone?"
Values for the above five features are negative, which is expected since each value represents a point in time prior to the time of the loan application's submission, which Home Credit defines as time 0, which is the maximum value for a few of these features.
What's unexpected is that the DAYS_EMPLOYED feature (the number of days the applicant has had a job prior to the day they submitted their loan application) has a maximum value that is both a positive number as well as unbelievably large. The maximum value for DAYS_EMPLOYED is 365,243 days, or just over 1,000 years.
No human being lives for 1,000 years, let alone sustains a job for that long, so this entry clearly indicates that some sort of mistake was made. What's not yet clear to me is whether DAYS_EMPLOYED contains only one, a handful, or possibly several such points. Clearly this particular data point and any similar to it are outliers that should be removed, especially if the DAYS_EMPLOYED feature turns out to be otherwise useful for predicting target values. Indeed, based on my intuition, DAYS_EMPLOYED is one of the first features that I would guess would be relevant in predicting whether a borrower would eventually make a late loan payment.
Creating and inspecting a histogram of this feature's data should help me to guage whether or not there are any other outlier data points that would need to be removed.
I was undecided whether the features OWN_CAR_AGE, the age of the applicant's car, HOUR_APPR_PROCESS_START, the hour the loan application was submitted, CNT_CHILDREN, the number of children that the applicant has, and CNT_FAM_MEMBERS, the size of the applicant's family, should be thought of as categorical or numerical.
The first reason for this uncertainty was because the entries in each feature were rounded to whole numbers. The second reason was that the range of whole number entries for each feature was quite limited -- [0,23] for HOUR_APPR_PROCESS_START, [0.0,91.0] for OWN_CAR_AGE, [0.0,19.0] for CNT_CHILDREN, and [1.0,20.0] for CNT_FAM_MEMBERS. Although the range of values in OWN_CAR_AGE is nearly four times that of HOUR_APPR_PROCESS_START, upon exploring the individual entries, I found that the majority of values appeared to be in the range [0.0,20.0]. This makes sense, considering that most cars don't last longer than twenty years. The efective range of HOUR_APPR_PROCESS_START values appeared to be even more narrow, with most entries concentrated inside the range [9,17]. This also makes sense, as regular business hours typically run from 9AM to 5PM.
I ultimately decided that even though the effective ranges of both OWN_CAR_AGE and HOUR_APPR_PROCESS_START, are far more narrow than other numerical features in the main data table, the nature of each feature's data requires that I treat both as numerical features.
The other categorical features in the main data table, such as whether the applicant owns a car, or the applicant's housing type, each have distinct entries that can encapsulate wildly different meanings and implications. The condition of owning a car is very different from the condition of not owning a car, and the lifestyle, financial and otherwise, of someone living in an apartment is likely very different from that of someone living in a stand-alone house.
After thinking along these lines, it was easy for me to see that neither the entries in OWN_CAR_AGE nor those HOUR_APPR_PROCESS_START are necessarily always that different from one another in meaning or implication. For example, is submitting a loan application at 3PM really that different from submitting at 4PM? What about having a car that's eight years old versus having a car that's nine years old? For data like this, it is far more likely that there are meaningful sub-ranges, such as the afternoon hours of 1PM to 5PM, that may be helpful in predicting a borrower's target value. The only way I will be able to discover these sub-ranges is if I treat these features as numerical, not categorical.
Things are different for the CNT_CHILDREN and CNT_FAM_MEMBERS features. Although the most children any loan applicant had was 19, at least 75% of all applicants had either zero children or just one child. I decided that it makes most sense to treat CNT_CHILDREN as a categorical feature, and re-engineer it to segment the borrower population into the following two categories: having no children, and having one or more children. Although I expect there will groups of borrowers of diminishing size that have 2,3,4,...,19 children, my hypothesis is that having no children versus having at least one child will be the information most useful to predict target values for the overall population of borrowers. Even if I were to spend time investigating the effects of having 2 vs. 3 vs. 4 vs. ... vs. 19 children, I know that my findings would apply to less than 25% of the overall population and any predictions informed by these effects likely wouldn't generalize well to unseen datapoints.
I will transform the CNT_CHILDREN feature into a binary categorical feature called HAS_CHILDREN. If the value of CNT_CHILDREN is greater than 0, the value of HAS_CHILDREN will be 1. If the value of CNT_CHILDREN is 0, the value of HAS_CHILDREN will be 0.
For CNT_FAM_MEMBERS, the situation is somewhat similar. 25% of borrowers in the training set have a family size of just one, 50% have a family of two or less, and 75% of borrowers have families of 3 people or less. I plan to re-engineer this feature to categorically segment borrowers into the following three groups: having a family size of one, a family size of two, and having a family that's three people or larger.
I will transform the CNT_FAM_MEMBERS feature into a categorical feature called NUMBER_FAMILY_MEMBERS. If CNT_FAM_MEMBERS is 1.0, then the value of NUMBER_FAMILY_MEMBERS will be 'one'. If CNT_FAM_MEMBERS is 2.0, then NUMBER_FAMILY_MEMBERS will be 'two'. If CNT_FAM_MEMBERS is 3.0 or greater, then NUMBER_FAMILY_MEMBERS will be 'three_plus'. The new categorical feature NUMBER_FAMILY_MEMBERS will eventually be one-hot encoded.
While most features reported as being normalized have a max value of 1.0 and a min value of 0.0. The following three features all have values within the range (0.0, 1.0), none of them have max values as 1.0, nor min values of 0.0. Even though Home Credit states that each of these three features have been normalized, because the max and min values of their range supposedly normalized values are different than what I've observed for all other features reported as normalized, I will need to pay special attention to the graphs of the distributions of these three features when I conduct my exploratory data visualization, in order to verify that these features' data is indeed distributed normally:
EXT_SOURCE_1EXT_SOURCE_2EXT_SOURCE_3The feature REGION_POPULATION_RELATIVE appears to be a unique case in that while it has also supposedly been normalized, its values merely fall into the range [0.000290, 0.072508]. All other features that Home Credit claims have been normalized more or less fall into the range [0.0,1.0]. It's therefore a given that I'll need to min-max scale this feature during data preprocessing.
The following four features were defined by Home Credit as being "normalized":
FONDKAPREMONT_MODEHOUSETYPE_MODEWALLSMATERIAL_MODEEMERGENCYSTATE_MODESpecifically, the definition contained in the HomeCredit_columns_description.csv file provided by Home Credit stated that these features were: "Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor"
However, upon further investigation I found that each of these four features was in fact categorical, and would need to be one-hot encoded. WALLSMATERIAL_MODE, for example, contains entries such as "Stone, brick", "panel", and "block" that indicate the material(s) of which the walls in the borrower's house are built.
I divided the main data table's features into groups based both on whether the feature is categorical or numerical, as well as how the feature would need to be preprocessed. ie. has the feature already been normalized, has it already been one-hot encoded, etc. Features that contain at least one 'NaN' value are highlighted in red.
Categorical features needing one-hot encoding:
Categorical features originally mis-identified by Home Credit as normalized (they also need to be one-hot encoded):
Categorical features needing re-engineering:
HAS_CHILDREN, with value of 1 if borrower has 0 children vs. a value of 1 if borrower has at least one child.NUMBER_FAMILY_MEMBERS, with values of 'one', 'two', or 'three_plus' depending on whether the borrower's value for CNT_FAM_MEMBERS was 1, 2, or 3 or more. NUMBER_FAMILY_MEMBERS will eventually be one-hot encoded.Binary Categorical features already one-hot encoded:
Numerical features not identified as normalized, that have roughly normal distributions:
Numerical features not identified as normalized, that have skewed distributions:
Numerical features not identified as normalized, that have skewed distributions and negative values:
HAS_JOB, with value of 1 if borrower has a value of 0 or less for DAYS_EMPLOYED. HAS_JOB will have a value of 0 if the borrower is one of the 55,374 folks who has a value of 365243 for DAYS_EMPLOYED.Numerical features identified as normalized, which are scaled to range [0,1]:
Numerical features identified as normalized, which are not scaled to range [0,1]:
Summing the above lists gives 14+4+2+32+3+15+3+46+1 = 120 features, which are all of the features in the main data table.
Main Data Table Featureset Definitions
# Display the first five records
display(bureau_data.head(n=5))
# Display the above data sample table, but with axes transposed, so that the table can
# be included in the project's writeup.
display(bureau_data.head(n=5).transpose())
# Display a statistical description of the numerical features in the bureau data table.
display(bureau_data.describe())
# Display the above statistical description table, but with axes inverted, so that the table can
# be included in the project's writeup.
display(bureau_data.describe().transpose())
Bureau Data Table Featureset Definitions
# Display the first five records
display(bureau_balance_data.head(n=5))
Bureau Balance Data Table Featureset Definitions
# Display the first five records
display(previous_application_data.head(n=5))
Previous Application Data Table Featureset Definitions
# Display the first five records
display(POS_CASH_balance_data.head(n=5))
POS CASH Balance Data Table Featureset Definitions
# Display the first five records
display(installments_payments_data.head(n=5))
Installments Payments Data Table Featureset Definitions
# Display the first five records
display(credit_card_balance_data.head(n=5))
Credit Card Balance Data Table Featureset Definitions
What is one feature that could be engineered from the data contained in the six tables that are supplementary to the main data table?
Of the six other tables in the dataset outside of the main data table, four tables (previous_application.csv, POS_CASH_balance.csv, installments_payments.csv, and credit_card_balance.csv) contain information pertaining to previous loan applications, or payback histories on prior loans, that an applicant has had with Home Credit. To engineer a new feature, I instead intend to focus on the two data tables that describe applicants' payback performance with lenders other than Home Credit:
bureau.csvbureau_balance.csvMy hypothesis is that out of the six supplementary data tables, the above two tables will be the greatest source of supplementary insight.
bureau.csv contains summary information of applicants' loans from other lenders, such as the amount and type of loan, and the total amount, if any, of the repayment balance that's overdue. bureau_balance.csv contains the month by month statuses (whether a particular month's balance payment was received and processed, or the extent to which payment is overdue) for each loan described in bureau.csv.
Although the month-by-month payment statuses in bureau_balance.csv may prove useful with the right kind of time series analysis, for the purposes of this project I will attempt to engineer a feature that is based on the data contained in the features in bureau.csv. In particular, I will focus on the features in bureau.csv that indicate whether an individual has had difficulty repaying previous loans, the extent of that difficulty, and how recently that difficulty has occurred. Some potentially useful features include:
DAYS_CREDIT: Number of days since individual had applied for a loan.DAYS_CREDIT_UPDATE: Number of days since individual's information on the credit bureau was updated.CREDIT_DAY_OVERDUE: Number of days that individual's loan payments have been overdue.AMT_CREDIT_MAX_OVERDUE: Maximum amount individual has ever been overdue on their loan payments.AMT_CREDIT_SUM_OVERDUE: Amount individual is currently overdue on their loan payments.Since I'm interested in knowing whether a Home Credit loan applicant has had recent difficulty paying back loans they've received from other creditors, I will build a feature based solely on the CREDIT_DAY_OVERDUE feature. For simplicity's sake, my engineered feature will merely indicate whether or not a Home Credit applicant currently has overdue loan payments from other creditors.
My new feature will be titled HAS_CREDIT_BUREAU_LOANS_OVERDUE. If a Home Credit applicant has at least one loan in bureau.csv for which CREDIT_DAY_OVERDUE has a value greater than 0, the value of my new feature in the row belonging to that applicant's Home Credit borrower ID will be 1. Otherwise, the value will be 0. I will engineer this feature during the data preprocessing phase, and once it has been created I will append it to the main data table.
Home Credit had identified the following 47 numerical features as being normalized. I plotted histograms of each of these features in order to confirm that this is indeed that case, as well as to identify any outliers that might need to be removed.
'REGION_POPULATION_RELATIVE', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE'
The plotted histograms ignored all 'NaN' entries in each of the above features. Plots for all 47 features can be viewed in the Appendix.
# List of normalized features
normalized_numerical_features = ['REGION_POPULATION_RELATIVE', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE']
# Plot histogram of each normalized feature, omitting any rows (borrowers) that have a value
# of 'NaN' for the particular feature
vs.plot_feature_distributions(application_train_data[normalized_numerical_features], title='Distributions of Main Data Table\'s Normalized Features', figsize=(14,60), num_cols=3)
I paid special attention to the distributions of the four features I had found anomalous while initially exploring the dataset, REGION_POPULATION_RELATIVE, EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3.
It turned out that EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3 were the three features I should have been least concerned about -- the shapes of these features' distributions more closely resembled that of the normal bell curve than the shapes of all the other normalized numerical features. And upon closer review, perhaps this shouldn't be so surprising. According to Home Credit's definitions, these three features represent normalized scores that come from an "external data source," and I can only surmise that whatever methodology the external source used to devise and assign these scores may be what causes their values to be more normally distributed across the dataset.
# Plot histogram of ['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3'], omitting any rows (borrowers) that have a value
# of 'NaN' for the particular feature
vs.plot_feature_distributions(application_train_data[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']], title='Distributions of Features EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3', figsize=(14,4), num_cols=3)
As I had discovered while exploring the dataset, the feature REGION_POPULATION_RELATIVE does indeed have all its values within an approximate range of [0.00,0.07]. There is nothing in the feature's definition that suggests why, out of all the other normalized features, that this would be the only feature not scaled to the range [0.0,1.0]. Thankfully, this feature also exhibits minimal positive/negative skewness. The only adjustment necessary would be to min-max scale this feature to the range [0.0,1.0] when preprocessing the data.
# Plot histogram of ['REGION_POPULATION_RELATIVE'], omitting any rows (borrowers) that have a value
# of 'NaN' for the particular feature
region_population_relative_data = application_train_data['REGION_POPULATION_RELATIVE']
filtered_region_population_relative_data = region_population_relative_data[~np.isnan(region_population_relative_data)]
plt.figure(figsize = (10,6))
plt.hist(filtered_region_population_relative_data, bins=50)
plt.title('Distribution of Feature REGION_POPULATION_RELATIVE')
plt.xlabel('Value')
plt.ylabel('Number of Borrowers')
plt.savefig('distribREGIONPOPULATIONRELATIVE.png')
plt.show()
Other normalized features, particularly those concerning general characteristics of a borrower's residence, such as YEARS_BUILD_AVG, FLOORSMIN_MODE, and FLOORSMAX_AVG also had the appearance of a normal distribution without much positive or negative skewness. This intuitively makes sense as these general features about number of floors and the year the residence was built will pertain to all individuals' residences, regardless of the dwelling type.
# Plot histogram of ['YEARS_BUILD_AVG', 'FLOORSMIN_MODE, 'FLOORSMAX_AVG'], omitting any rows (borrowers) that have a value
# of 'NaN' for the particular feature
vs.plot_feature_distributions(application_train_data[['YEARS_BUILD_AVG', 'FLOORSMIN_MODE', 'FLOORSMAX_AVG']], title='Distributions of Features YEARS_BUILD_AVG, FLOORSMIN_MODE, FLOORSMAX_AVG', figsize=(14,5), num_cols=3)
On the other hand, the rest of the normalized features, such as ELEVATORS_MEDI, COMMONAREA_MEDI, NONLIVINGAPARTMENTS_MEDI, describe more niche characteristics of a residence building that may not apply to many of the borrowers. For example, individuals who live in a house wouldn't be expected to have any elevators, a public common area, or non-living apartments in their residence. And as such, these features tend to be noticeably positively skewed.
# Plot histogram of ['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3'], omitting any rows (borrowers) that have a value
# of 'NaN' for the particular feature
vs.plot_feature_distributions(application_train_data[['ELEVATORS_MEDI', 'COMMONAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI']], title='Distributions of Features ELEVATORS_MEDI, COMMONAREA_MEDI, NONLIVINGAPARTMENTS_MED', figsize=(14,5), num_cols=3)
The following 21 numerical features were not identified by Home Credit as being normalized. I plotted histograms of each of these features in order to observe their skewness and to discover which ones would be candidates for log-normalization:
'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'HOUR_APPR_PROCESS_START', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'
# List of non-normalized features
non_normalized_numerical_features = ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'HOUR_APPR_PROCESS_START', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
# Plot histogram of each non-normalized feature, omitting any rows (borrowers) that have a value
# of 'NaN' for the particular feature
vs.plot_feature_distributions(application_train_data[non_normalized_numerical_features], title='Distributions of Main Data Table\'s Non-Normalized Features', figsize=(16,30), num_cols=3)
# Plot histogram of ['DAYS_BIRTH','DAYS_ID_PUBLISH','HOUR_APPR_PROCESS_START'], omitting any rows (borrowers) that have a value
# of 'NaN' for the particular feature
vs.plot_feature_distributions(application_train_data[['DAYS_BIRTH','DAYS_ID_PUBLISH','HOUR_APPR_PROCESS_START']], title='Distributions of Features DAYS_BIRTH, DAYS_ID_PUBLISH, HOUR_APPR_PROCESS_START', figsize=(14,5), num_cols=3)
While all of these 21 features will need to be min-max scaled to a range of [0.0,1.0], the following three features already exhibit non-skewed, normal shaped distributions. It will not be necessary for them to be log-normalized:
DAYS_BIRTHDAYS_ID_PUBLISHHOUR_APPR_PROCESS_STARTThe rest of the features, especially those with majority of their values concentrated close to zero yet also having a smattering of large-valued data points, are good candidates for log-normalization. Doing this may prevent these features' very large and very small values from negatively affecting the performance of a learning algorithm.
There are three non-normalized features DAYS_EMPLOYED, DAYS_REGISTRATION, and DAYS_LAST_PHONE_CHANGE that both have skewed distributions, as well as values that fall inside a range of negative numbers. Because log-transformation cannot be run on negative values, these features' distributions would first need to be translated positively to the right, such that all their values are greater than or equal to zero.
# Plot histogram of ['DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_LAST_PHONE_CHANGE'], omitting any rows (borrowers) that have skewed
# distributions over a range of negative values.
vs.plot_feature_distributions(application_train_data[['DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_LAST_PHONE_CHANGE']], title='Distributions of Features DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_LAST_PHONE_CHANGE', figsize=(14,5), num_cols=3)
Unfortunately, DAYS_EMPLOYED and OWN_CAR_AGE contain an alarming amount of high-valued outliers that likely need to be addressed. I will further explore each of these features below.
DAYS_EMPLOYED Feature¶I paid extra close attention to the histogram of DAYS_EMPLOYED, in order to observe whether it has other impossibly large, positive entries similar to the value of 365,243 days, or just over 1,000 years, that I observed above.
# Draw a larger plot of the DAYS_EMPLOYED feature's histogram
plt.figure(figsize = (10,6))
plt.hist(application_train_data['DAYS_EMPLOYED'], bins=50)
plt.title('DAYS_EMPLOYED Distribution')
plt.xlabel('Value')
plt.ylabel('Number of Borrowers')
plt.savefig('distribDAYSEMPLOYED.png')
plt.show()
application_train_data['DAYS_EMPLOYED'].value_counts().sort_index(ascending=False)
Based on the definition of DAYS_EMPLOYED, valid values should be in the range (-inf,0]. Unfortunately, 55,374 entries, or nearly one-sixth of the entire training dataset, has an entry of 365243 for this feature. Thankfully, DAYS_EMPLOYED has no 'NaN' entries. Were this value to be interpreted literally, it would indicate that one-sixth of the dataset got a job just over 1,000 years after submitting their loan applications to Home Credit.
This meaning is obviously absurd and there has to be another reason that so many borrowers had the value of 365243 entered for this feature. Since there were no instances of borrowers having a different non-zero positive value for this feature, my best guess is that 365243 was not meant to indicate a numerical value. I believe that this value was entered for applicants who did not have a job when they submitted their loan application to Home Credit. Since any negative integer or 0 would be a valid entry for this feature, and perhaps due to a data entry system's inability to accept any value besides an integer, I hypothesize that the original data enterers simply entered the largest positive integer that the system would accept in order to indicate that the applicant didn't have a job.
Intuitively, I can see that DAYS_EMPLOYED may well be a good predictor of target segments -- after all, if someone doesn't have a job, it stands to reason that there is a greater chance they won't have enough money to make loan payments on time. Unfortunately, I am not confident that the feature, as currently structured, would be able to adequately convey this information to a learning algorithm.
I propose to replace the DAYS_EMPLOYED feature with a new categorical feature called HAS_JOB. All individuals who have a value of 365243 for DAYS_EMPLOYED will be assigned a value of 0 for HAS_JOB. All individuals who have a value of 0 or less for DAYS_EMPLOYED will be assigned a value of 1 for HAS_JOB.
OWN_CAR_AGE Feature¶# Plot histogram of ['OWN_CAR_AGE'], omitting any rows (borrowers) that have a value
# of 'NaN' for the particular feature
own_car_age_data = application_train_data['OWN_CAR_AGE']
filtered_own_car_age_data = own_car_age_data[~np.isnan(own_car_age_data)]
plt.figure(figsize = (10,6))
plt.hist(filtered_own_car_age_data, bins=50)
plt.title('Distribution of Feature OWN_CAR_AGE')
plt.xlabel('Value')
plt.ylabel('Number of Borrowers')
plt.savefig('distribOWNCARAGE.png')
plt.show()
There appear to be just under 4,000 borrowers who have a value for OWN_CAR_AGE that's between 60 and 70. This is far too many people to indicate anything but some sort of anomaly. The only kinds of cars that are still functional after 60+ years are collectible classic cars, and folks who apply for loans from Home Credit come from a far less well-off demographic than that which is associated with classic car collection.
application_train_data['OWN_CAR_AGE'].value_counts().sort_index(ascending=False)
Nonetheless, based on the unique value counts above, I can see right away that the distribution of OWN_CAR_AGE is less problematic than was the case for DAYS_EMPLOYED. There is at least a smooth decrease in numbers of users having older cars, right up until the anomalous spike.
However, the distribution's pattern would indicate that I should expect only one or two borrowers each to have cars that are 64 and 65 years old. I can't hazard a reasonable guess that could explain why a total of 3,334 individuals have cars aged 64 or 65 years, and I can't formulate a compelling justification for removing these entries from the OWN_CAR_AGE feature. The good news is that this is probably ok. Since 3,334 borrowers is only just over 3% of the OWN_CAR_AGE feature's 104,582 valid non-'NaN' entries, log-normalizing this feature should be enough to counteract any negative effects that the anomalous spike may have on a learning algorithm.
In this section I lay out a roadmap for devising a learning algorithm that predicts which borrowers will make at least one late loan payment (which borrowers have a TARGET value of 1). My approach takes into account everything I learned while exploring and visualizing the main data table.
Data Preprocessing:
CNT_CHILDREN feature to engineer a binary categorical feature called HAS_CHILDREN.CNT_CHILDREN from the main dataframe.CNT_FAM_MEMBERS feature to engineer a categorical feature called FAMILY_SIZE.CNT_FAM_MEMBERS from the main dataframe.CREDIT_DAY_OVERDUE feature in bureau.csv to engineer the binary categorical HAS_CREDIT_BUREAU_LOANS_OVERDUE feature.DAYS_EMPLOYED feature to engineer a binary categorical feature called HAS_JOB.DAYS_EMPLOYED feature from the main dataframe.DAYS_REGISTRATION, and DAYS_LAST_PHONE_CHANGE.SK_ID_CURR, from the main dataframe.DAYS_BIRTH, DAYS_ID_PUBLISH, HOUR_APPR_PROCESS_START, and the normalized feature REGION_POPULATION_RELATIVE. Each feature will be scaled to a range [0.0, 1.0].Implementation:
Refinement:
application_test.csv, returns the area under the ROC curve as a score, and outputs a CSV file containing the posterior probabilities of the classifier's predictions for a testing data set.application_train.csv.application_test.csv.# Some imports are redundant with imports made in the early code blocks
# of this notebook. Repeated here for convenience, so that code blocks
# from much higher up don't have to be re-executed when re-initiating
# this notebook.
# Import necessary libraries.
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
# Import supplementary visualizations code visuals.py
import visuals as vs
# Display matplotlib plots inline in this notebook.
%matplotlib inline
# Make plots display well on retina displays
%config InlineBackend.figure_format = 'retina'
# Set dpi of plots displayed inline
mpl.rcParams['figure.dpi'] = 300
# Configure style of plots
plt.style.use('fivethirtyeight')
# Make plots smaller
sns.set_context('paper')
# Allows the use of display() for dataframes.
from IPython.display import display
# Have all columns appear when dataframes are displayed.
pd.set_option('display.max_columns', None)
# Have 100 rows appear when a dataframe is displayed
pd.set_option('display.max_rows', 500)
# Display dimensions whenever a dataframe is printed out.
pd.set_option('display.show_dimensions', True)
# Import data preprocessing libraries
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import MinMaxScaler
# Import feature selection/dimensionality reduction libraries
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif, chi2
# Import learning algorithms
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
# Import ROC area-under-curve score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer
# Import train-test split, ShuffleSplit, GridSearchCV, and K-fold cross validation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV, ParameterGrid
from sklearn.model_selection import StratifiedKFold
# Import a Logistic Regression classifier
from sklearn.linear_model import LogisticRegression
# Import a Multi-layer Perceptron classifier
from sklearn.neural_network import MLPClassifier
# Import a LightGBM classifier
import lightgbm as lgb
# In order to create CSV files
import csv
# Load the main data tables
application_train_data = pd.read_csv("data/application_train.csv")
application_test_data = pd.read_csv("data/application_test.csv")
# Load the Bureau data table
bureau_data = pd.read_csv("data/bureau.csv")
# Step 1: Create lists of different feature types in the main data
# frame, based on how each type will need to be preprocessed.
# 1. All 18 categorical features needing one-hot encoding.
# Includes the 4 categorical features originally
# mis-identified as having been normalized:
# EMERGENCYSTATE_MODE, HOUSETYPE_MODE, WALLSMATERIAL_MODE,
# FONDKAPREMONT_MODE
cat_feat_need_one_hot = [
'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
'FLAG_OWN_REALTY', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE',
'NAME_TYPE_SUITE', 'OCCUPATION_TYPE', 'EMERGENCYSTATE_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'FONDKAPREMONT_MODE'
]
# 2. All 32 binary categorical features already one-hot encoded.
bin_cat_feat = [
'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'
]
# 3. All 2 non-normalized numerical features with skewed distributions
# and negative values. These features will need to have their
# distributions translated to positive ranges before being
# log-transformed, and then later scaled to the range [0,1].
non_norm_feat_neg_values_skewed = [
'DAYS_REGISTRATION', 'DAYS_LAST_PHONE_CHANGE'
]
# 4. All 15 non-normalized numerical features with skewed distributions,
# and only positive values. These features will need to be
# log-transformed, and eventually scaled to the range [0,1].
non_norm_feat_pos_values_skewed = [
'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
'AMT_GOODS_PRICE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'OWN_CAR_AGE'
]
# 5. All 4 numerical features with normal shapes but needing to be scaled
# to the range [0,1].
norm_feat_need_scaling = [
'DAYS_BIRTH', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START',
'REGION_POPULATION_RELATIVE'
]
# 6. All 46 numerical features that have been normalized to the range
# [0,1]. These features will need neither log-transformation, nor
# any further scaling.
norm_feat_not_need_scaling = [
'EXT_SOURCE_2', 'EXT_SOURCE_3', 'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BEGINEXPLUATATION_MEDI', 'FLOORSMAX_AVG',
'FLOORSMAX_MODE', 'FLOORSMAX_MEDI', 'LIVINGAREA_AVG',
'LIVINGAREA_MODE', 'LIVINGAREA_MEDI', 'ENTRANCES_AVG',
'ENTRANCES_MODE', 'ENTRANCES_MEDI', 'APARTMENTS_AVG',
'APARTMENTS_MODE', 'APARTMENTS_MEDI', 'ELEVATORS_AVG',
'ELEVATORS_MODE', 'ELEVATORS_MEDI', 'NONLIVINGAREA_AVG',
'NONLIVINGAREA_MODE', 'NONLIVINGAREA_MEDI', 'EXT_SOURCE_1',
'BASEMENTAREA_AVG', 'BASEMENTAREA_MODE', 'BASEMENTAREA_MEDI',
'LANDAREA_AVG', 'LANDAREA_MODE', 'LANDAREA_MEDI',
'YEARS_BUILD_AVG', 'YEARS_BUILD_MODE', 'YEARS_BUILD_MEDI',
'FLOORSMIN_AVG', 'FLOORSMIN_MODE', 'FLOORSMIN_MEDI',
'LIVINGAPARTMENTS_AVG', 'LIVINGAPARTMENTS_MODE', 'LIVINGAPARTMENTS_MEDI',
'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_MEDI',
'COMMONAREA_AVG', 'COMMONAREA_MODE', 'COMMONAREA_MEDI',
'TOTALAREA_MODE'
]
# 7. The remaining 3 features in the main data frame that will be
# re-engineered and transformed into different features
feat_to_be_reengineered = [
'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'DAYS_EMPLOYED'
]
# Verify that all 120 features in the main data frame have been categorized
# according to how they will be preprocessed.
count_of_categorized_features = len(cat_feat_need_one_hot) + len(bin_cat_feat) + len(non_norm_feat_neg_values_skewed)\
+ len(non_norm_feat_pos_values_skewed) + len(norm_feat_need_scaling) + len(norm_feat_not_need_scaling) + len(feat_to_be_reengineered)
print('Number of features in main data frame that have been categorized: {}. Expected: 120.'.format(count_of_categorized_features))
#Step 2: Separate target data from training dataset.
targets = application_train_data['TARGET']
features_raw = application_train_data.drop('TARGET', axis = 1)
# Step 3: Use train_test_split from sklearn.cross_validation to
# create a test validation set that is 20% of the size of the total training set:
# Will allow me to compare performance of various learning algorithms without
# overfitting to the training data.
X_train_raw, X_test_raw, y_train, y_test = train_test_split(features_raw,
targets,
test_size = 0.2,
random_state = 42)
# Step 4: Use the CNT_CHILDREN feature to engineer a binary
# categorical feature called HAS_CHILDREN. If value of CNT_CHILDREN is
# greater than 0, the value of HAS_CHILDREN will be 1. If value of CNT_CHILDREN is
# 0, value of HAS_CHILDREN will be 0.
CNT_CHILDREN_train = X_train_raw['CNT_CHILDREN']
HAS_CHILDREN = CNT_CHILDREN_train.map(lambda x: 1 if x > 0 else 0)
# Append the newly engineered HAS_CHILDREN feature to the main dataframe.
X_train_raw = X_train_raw.assign(HAS_CHILDREN=HAS_CHILDREN.values)
# Step 5: Drop the CNT_CHILDREN column from the main dataframe
X_train_raw = X_train_raw.drop('CNT_CHILDREN', axis=1)
# Add the new HAS_CHILDREN feature to the list of binary categorical
# features that are already one-hot encoded. There are now 33 such features.
bin_cat_feat = bin_cat_feat + ['HAS_CHILDREN']
# Step 6. Use the CNT_FAM_MEMBERS feature to engineer a categorical feature called NUMBER_FAMILY_MEMBERS.
# If CNT_FAM_MEMBERS is 1.0, then the value of NUMBER_FAMILY_MEMBERS will be 'one'. If CNT_FAM_MEMBERS is 2.0,
# then NUMBER_FAMILY_MEMBERS will be 'two'. If CNT_FAM_MEMBERS is 3.0 or greater, then NUMBER_FAMILY_MEMBERS will
# be 'three_plus'.
CNT_FAM_MEMBERS_train = X_train_raw['CNT_FAM_MEMBERS']
NUMBER_FAMILY_MEMBERS = CNT_FAM_MEMBERS_train.map(lambda x: 'one' if x == 1 else ('two' if x == 2 else 'three_plus'))
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
X_train_raw = X_train_raw.assign(NUMBER_FAMILY_MEMBERS=NUMBER_FAMILY_MEMBERS.values)
# Step 7. Drop the CNT_FAM_MEMBERS feature from the main dataframe
X_train_raw = X_train_raw.drop('CNT_FAM_MEMBERS', axis=1)
# Add the new NUMBER_FAMILY_MEMBERS feature to the list of categorical
# features that will need to be one-hot encoded. There are now 19 of these features.
cat_feat_need_one_hot = cat_feat_need_one_hot + ['NUMBER_FAMILY_MEMBERS']
# Step 8. Use the CREDIT_DAY_OVERDUE feature in bureau.csv to engineer the binary
# categorical HAS_CREDIT_BUREAU_LOANS_OVERDUE feature. If CREDIT_DAY_OVERDUE for a
# particular borrower ID (SK_ID_CURR) is greater than 0, then the value of
# HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 1. If CREDIT_DAY_OVERDUE for a particular
# borrower ID is 0, then the value of HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 0.
# Filter the bureau data table for loans which are overdue (have a value
# for CREDIT_DAY_OVERDUE that's greater than 0)
bureau_data_filtered_for_overdue = bureau_data[bureau_data['CREDIT_DAY_OVERDUE'] > 0]
def build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(dataframe):
"""
Use the CREDIT_DAY_OVERDUE feature in bureau.csv to engineer the binary
categorical HAS_CREDIT_BUREAU_LOANS_OVERDUE feature. If CREDIT_DAY_OVERDUE for a
particular borrower ID (SK_ID_CURR) is greater than 0, then the value of
HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 1. If CREDIT_DAY_OVERDUE for a particular
borrower ID is 0, then the value of HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 0.
Parameters:
dataframe: Pandas dataframe containing a training or testing dataset
Returns: The dataframe with HAS_CREDIT_BUREAU_LOANS_OVERDUE feature appended to it.
"""
# Create a series called HAS_CREDIT_BUREAU_LOANS_OVERDUE and fill it with zeros.
# Its index is identical to that of the main dataframe. It will eventually be appended
# to the main data frame as a column.
HAS_CREDIT_BUREAU_LOANS_OVERDUE = pd.Series(data=0, index = dataframe['SK_ID_CURR'].index)
# A list of all the borrowers IDs in the main dataframe
main_data_table_borrower_IDs = dataframe['SK_ID_CURR'].values
# For each loan in the bureau data table that is overdue
# (has a value for CREDIT_DAY_OVERDUE that's greater than 0)
for index, row in bureau_data_filtered_for_overdue.iterrows():
# The borrower ID (SK_ID_CURR) that owns the overdue loan
borrower_ID = row['SK_ID_CURR']
# If the borrower ID owning the overdue loan is also
# in the main data frame, then enter a value of 1 in
# the series HAS_CREDIT_BUREAU_LOANS_OVERDUE at an index
# that is identical to the index of the borrower ID
# in the main data frame.
if borrower_ID in main_data_table_borrower_IDs:
# The index of the borrower's row in the main data table.
borrower_index_main_data_table = dataframe.index[dataframe['SK_ID_CURR'] == borrower_ID].tolist()[0]
# Place a value of 1 at the index of the series HAS_CREDIT_BUREAU_LOANS_OVERDUE
# which corresponds to the index of the borrower's ID in the main data table.
HAS_CREDIT_BUREAU_LOANS_OVERDUE.loc[borrower_index_main_data_table] = 1
# Append the newly engineered HAS_CREDIT_BUREAU_LOANS_OVERDUE feature to the main dataframe.
dataframe = dataframe.assign(HAS_CREDIT_BUREAU_LOANS_OVERDUE=HAS_CREDIT_BUREAU_LOANS_OVERDUE.values)
return dataframe
# Build the HAS_CREDIT_BUREAU_LOANS_OVERDUE feature
X_train_raw = build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(X_train_raw)
# Add the new HAS_CREDIT_BUREAU_LOANS_OVERDUE feature to the list of binary categorical
# features. There are now 34 of these features.
bin_cat_feat = bin_cat_feat + ['HAS_CREDIT_BUREAU_LOANS_OVERDUE']
# Find out what fraction of borrowers in the dataframe have overdue
# loans from other lenders besides Home Credit.
num_borrowers_maindataframe = X_train_raw.shape[0]
num_borrowers_maindataframe_with_other_overdue_loans = X_train_raw[X_train_raw['HAS_CREDIT_BUREAU_LOANS_OVERDUE'] == 1].shape[0]
percent_borrowers_with_other_overdue_loans = round(num_borrowers_maindataframe_with_other_overdue_loans*100./num_borrowers_maindataframe, 2)
print('{} borrowers, or {}% of the training segment\'s {} borrowers have overdue loans from other lenders.'.format(num_borrowers_maindataframe_with_other_overdue_loans, percent_borrowers_with_other_overdue_loans, num_borrowers_maindataframe))
# Step 9. Use the DAYS_EMPLOYED feature to engineer a binary categorical feature called HAS_JOB.
# If the value of DAYS_EMPLOYED is 0 or less, then HAS_JOB will be 1. Otherwise, HAS_JOB will
# be 0. This condition will apply to all borrowers who had a value of 365243 for DAYS_EMPLOYED,
# which I hypothesized can be best interpreted as meaning that a borrower does not have a job.
DAYS_EMPLOYED_train = X_train_raw['DAYS_EMPLOYED']
HAS_JOB = DAYS_EMPLOYED_train.map(lambda x: 1 if x <= 0 else 0)
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
X_train_raw = X_train_raw.assign(HAS_JOB=HAS_JOB.values)
# Step 10. Drop the DAYS_EMPLOYED feature from the main dataframe
X_train_raw = X_train_raw.drop('DAYS_EMPLOYED', axis=1)
# Add the new HAS_JOB feature to the list of binary categorical features.
# There are now 35 of these features.
bin_cat_feat = bin_cat_feat + ['HAS_JOB']
# Step 11. Translate the 2 non-normalized numerical features that have skewed distributions
# and negative values: DAYS_REGISTRATION, and DAYS_LAST_PHONE_CHANGE
def translate_negative_valued_features(dataframe, feature_name_list):
"""
Translate a dataset's continuous features containing several negative
values. The dataframe is modified such that all values of each feature
listed in the feature_name_list parameter become positive.
Parameters:
dataframe: Pandas dataframe containing the features
feature_name_list: List of strings, containing the names
of each feature whose values will be
translated
"""
for feature in feature_name_list:
# The minimum, most-negative, value of the feature
feature_min_value = dataframe[feature].min()
# Translate each value of the feature in a positive direction,
# of magnitude that's equal to the feature's most negative value.
dataframe[feature] = dataframe[feature].apply(lambda x: x - feature_min_value)
# Translate the above two negatively-valued features to positive values
translate_negative_valued_features(X_train_raw, non_norm_feat_neg_values_skewed)
# Step 12. Log-transform all 17 non-normalized numerical features that have skewed distributions.
# These 17 features include the 2 that were translated to positive ranges in Step 11.
# Add the 2 features translated to positive ranges above in Step 11 to
# the list of non-normalized skewed features with positive values. This is
# the set of features that will be log-transformed
log_transform_feats = non_norm_feat_pos_values_skewed + non_norm_feat_neg_values_skewed
X_train_raw[log_transform_feats] = X_train_raw[log_transform_feats].apply(lambda x: np.log(x + 1))
# Step 13. Replace 'NaN' values for all numerical features with each feature's mean. Fit an imputer
# to each numerical feature containing at least one 'NaN' entry.
# Create a list of all the 67 numerical features in the main dataframe. These include all
# 17 features that were log-transformed in Step 12, as well as the 4 normal features that
# still need to be scaled, as well as the 46 normal features that don't need scaling.
numerical_features = log_transform_feats + norm_feat_need_scaling + norm_feat_not_need_scaling
# Create a list of all numerical features in the training set that have at least one 'NaN' entry
numerical_features_with_nan = X_train_raw[numerical_features].columns[X_train_raw[numerical_features].isna().any()].tolist()
# Create an imputer
imputer = Imputer()
# Fit the imputer to each numerical feature in the training set that has 'NaN' values,
# and replace each 'NaN' entry of each feature with that feature's mean.
X_train_raw[numerical_features_with_nan] = imputer.fit_transform(X_train_raw[numerical_features_with_nan])
# Step 14. Remove the borrower ID column, SK_ID_CURR, from the main dataframe
X_train_raw = X_train_raw.drop('SK_ID_CURR', axis=1)
# Verify that the main training dataframe has the expected number of columns.
# Dataframe initially had 122 columns.
# 4 features have been added (HAS_CHILDREN, NUMBER_FAMILY_MEMBERS, HAS_CREDIT_BUREAU_LOANS_OVERDUE, HAS_JOB).
# 5 columns have been removed (TARGET, SK_ID_CURR, DAYS_EMPLOYED, CNT_CHILDREN, CNT_FAM_MEMBERS).
# Expected number of columns is thus 121.
print('The main training dataframe now has {} columns. Expected: 121.'.format(X_train_raw.shape[1]))
# Step 15. One-hot encode all 19 non-binary categorical features.
X_train_raw = pd.get_dummies(X_train_raw, columns=cat_feat_need_one_hot)
# Create a list that includes only the newly one-hot encoded features
# as well as all the categorical features that were already binary.
all_bin_cat_feat = X_train_raw.columns.tolist()
for column_name in X_train_raw[numerical_features].columns.tolist():
all_bin_cat_feat.remove(column_name)
# Observe how many binary features now exist in the dataframe after one-hot encoding the
# 19 non-binary categorical features.
print('After one-hot encoding, there are now {} binary features in the main training dataframe.'.format(len(all_bin_cat_feat)))
# Observe how many total columns now exist in the dataframe after one-hot encoding.
# It is expected there are 184 binary categorical features, and 67 scaled numerical features
# for a total of 251 features.
print('After one-hot encoding, there are now {} columns in the main training dataframe. Expected: 251.'.format(X_train_raw.shape[1]))
# Step 16. Replace all 'NaN' values in all binary categorical features with 0.
# Create a list of binary categorical features with at least one 'NaN' entry
bin_cat_feat_with_nan = X_train_raw[all_bin_cat_feat].columns[X_train_raw[all_bin_cat_feat].isna().any()].tolist()
# Replace each 'NaN' value in each of these binary features with 0
X_train_raw[bin_cat_feat_with_nan] = X_train_raw[bin_cat_feat_with_nan].fillna(value=0)
# Step 17. Fit a min-max scaler to each of the 17 log-transformed numerical features, as well
# as to the 4 features DAYS_BIRTH, DAYS_ID_PUBLISH, HOUR_APPR_PROCESS_START, and the normalized
# feature REGION_POPULATION_RELATIVE. Each feature will be scaled to a range [0.0, 1.0].
# Build a list of all 21 features needing scaling. Add the list of features that
# were log-normalized above in Step 12 to the list of normally shaped features
# that need to be scaled to the range [0,1].
feats_to_scale = norm_feat_need_scaling + log_transform_feats
# Initialize a scaler with the default range of [0,1]
scaler = MinMaxScaler()
# Fit the scaler to each of the features of the train set that need to be scaled,
# then transform each of these features' values to the new scale.
X_train_raw[feats_to_scale] = scaler.fit_transform(X_train_raw[feats_to_scale])
# Rename the dataframe to indicate that its columns have been fully preprocessed.
X_train_final = X_train_raw
# Indicate that training set preprocessing is done.
# Verify that the training dataframe has the expected number of columns.
# It is expected there are 184 binary categorical features,
# and 67 numerical features for a total of 251 features.
print('Training set preprocessing complete. The final training dataframe now has {} columns. Expected: 251.'.format(X_train_final.shape[1]))
# Step 18. Build a data preprocessing pipeline to used for all testing sets.
# This pipeline will recreate all features that were engineered in the
# training set during the original data preprocessing phase.
# The pipeline will also apply the min-max scaling transforms
# originally fit on features in the training set to all datapoints in a
# testing set.
def adjust_columns_application_test_csv_table(testing_dataframe):
"""
After it is one-hot encoded, application_test.csv data table will have one
extra column, 'REGION_RATING_CLIENT_W_CITY_-1', that is not present in the
training dataframe. This column will be removed from the testing datatable
in this case. Only 1 of the 48,744 rows in application_test.csv will have a
value of 1 for this feature following one-hot encoding. I am not worried
about this column's elimination from the testing dataframe affecting predictions.
Additionally, unlike the test validation set, which originally comprised 20% of
application_train.csv, application_test.csv will be missing the following columns
after it is one-hot encoded:
'CODE_GENDER_XNA', 'NAME_INCOME_TYPE_Maternity leave', 'NAME_FAMILY_STATUS_Unknown'
In this case, we need to insert these columns into the testing dataframe, at
the exact same indices they are located at in the fully preprocessed training
dataframe.Each inserted column will be filled with all zeros. (If each of these
binary features are missing from the application_test.csv data table, we can infer
that each borrower in thatdata table obviously would have a 0 for each feature were
it present.)
Parameters:
testing_dataframe: Pandas dataframe containing the testing dataset
contained in the file application_test.csv
Returns: a testing dataframe containing the exact same columns and
column order as found in the training dataframe
"""
# Identify any columns in the one-hot encoded testing_dataframe that
# are not in X_train_raw. These columns will need to be removed from the
# testing_dataframe. (Expected that there will only be one such
# column: 'REGION_RATING_CLIENT_W_CITY_-1')
X_train_columns_list = X_train_raw.columns.tolist()
testing_dataframe_columns_list = testing_dataframe.columns.tolist()
for column_name in X_train_columns_list:
if column_name in testing_dataframe_columns_list:
testing_dataframe_columns_list.remove(column_name)
columns_not_in_X_train_raw = testing_dataframe_columns_list
# Drop any column from the testing_dataframe that is not in the
# training dataframe. Expected to only be the one column 'REGION_RATING_CLIENT_W_CITY_-1'
for column in columns_not_in_X_train_raw:
testing_dataframe = testing_dataframe.drop(column, axis=1)
# Get the column indices of each of the features 'CODE_GENDER_XNA',
#'NAME_INCOME_TYPE_Maternity leave', 'NAME_FAMILY_STATUS_Unknown' from
# the raw training dataframe (X_train_raw) prior to having having PCA run on it.
loc_code_gender_training_frame = X_train_raw.columns.get_loc('CODE_GENDER_XNA')
loc_name_income_type_maternity_leave_training_frame = X_train_raw.columns.get_loc('NAME_INCOME_TYPE_Maternity leave')
loc_name_family_status_unknown_training_frame = X_train_raw.columns.get_loc('NAME_FAMILY_STATUS_Unknown')
# Insert each column into the testing dataframe at the same index it had
# in the X_train_raw dataframe before PCA was run. Fill each column with all 0s.
# Order is important. 'CODE_GENDER_XNA' should be inserted first, followed by
# 'NAME_INCOME_TYPE_Maternity leave', and then finally 'NAME_FAMILY_STATUS_Unknown'.
testing_dataframe.insert(loc=loc_code_gender_training_frame, column='CODE_GENDER_XNA', value=0)
testing_dataframe.insert(loc=loc_name_income_type_maternity_leave_training_frame, column='NAME_INCOME_TYPE_Maternity leave', value=0)
testing_dataframe.insert(loc=loc_name_family_status_unknown_training_frame, column='NAME_FAMILY_STATUS_Unknown', value=0)
return testing_dataframe
def test_set_preprocessing_pipeline(testing_dataframe):
"""
Recreate all features that were engineered in the training set during
the original data preprocessing phase. The pipeline will also apply
an imputer to the test data table fill 'NaN' values. Binary feature's 'Nan'
values will be filled with 0. The min-max scaler fit on features
in the training set will be applied to the numerical features in the testing set.
Parameters:
testing_dataframe: Pandas dataframe containing a testing dataset
Returns: a fully preprocessed testing dataframe
"""
# Create the HAS_CHILDREN feature.
CNT_CHILDREN_test = testing_dataframe['CNT_CHILDREN']
HAS_CHILDREN = CNT_CHILDREN_test.map(lambda x: 1 if x > 0 else 0)
# Append the newly engineered HAS_CHILDREN feature to the main dataframe.
testing_dataframe = testing_dataframe.assign(HAS_CHILDREN=HAS_CHILDREN.values)
# Drop the CNT_CHILDREN column from the main dataframe
testing_dataframe = testing_dataframe.drop('CNT_CHILDREN', axis=1)
# Create the NUMBER_FAMILY_MEMBERS feature.
CNT_FAM_MEMBERS_test = testing_dataframe['CNT_FAM_MEMBERS']
NUMBER_FAMILY_MEMBERS = CNT_FAM_MEMBERS_test.map(lambda x: 'one' if x == 1 else ('two' if x == 2 else 'three_plus'))
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
testing_dataframe = testing_dataframe.assign(NUMBER_FAMILY_MEMBERS=NUMBER_FAMILY_MEMBERS.values)
# Drop the CNT_FAM_MEMBERS feature from the main dataframe
testing_dataframe = testing_dataframe.drop('CNT_FAM_MEMBERS', axis=1)
# Build the HAS_CREDIT_BUREAU_LOANS_OVERDUE feature
testing_dataframe = build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(testing_dataframe)
# Create the HAS_JOB feature
DAYS_EMPLOYED_test = testing_dataframe['DAYS_EMPLOYED']
HAS_JOB = DAYS_EMPLOYED_test.map(lambda x: 1 if x <= 0 else 0)
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
testing_dataframe = testing_dataframe.assign(HAS_JOB=HAS_JOB.values)
# Drop the DAYS_EMPLOYED feature from the main dataframe
testing_dataframe = testing_dataframe.drop('DAYS_EMPLOYED', axis=1)
# Translate the two negatively-valued features DAYS_REGISTRATION, and
# DAYS_LAST_PHONE_CHANGE to positive values
translate_negative_valued_features(testing_dataframe, non_norm_feat_neg_values_skewed)
# Log-transform all 17 non-normalized numerical features that have skewed distributions.
testing_dataframe[log_transform_feats] = testing_dataframe[log_transform_feats].apply(lambda x: np.log(x + 1))
# Create a list of all numerical features in the testing dataframe that have at least one 'NaN' entry
numerical_features_with_nan_testing = testing_dataframe[numerical_features].columns[testing_dataframe[numerical_features].isna().any()].tolist()
# Use an imputer to replace 'NaN' values for all numerical features with each feature's mean.
testing_dataframe[numerical_features_with_nan_testing] = imputer.fit_transform(testing_dataframe[numerical_features_with_nan_testing])
# Remove the borrower ID column, SK_ID_CURR, from the main dataframe
testing_dataframe = testing_dataframe.drop('SK_ID_CURR', axis=1)
# One-hot encode all 19 non-binary categorical features.
testing_dataframe = pd.get_dummies(testing_dataframe, columns=cat_feat_need_one_hot)
# After one-hot encoding, the testing dataframe from application_test.csv will be
# missing 2 columns that are in the training dataframe. It will also have an extra
# column that was not in the training dataframe, giving it 249 total columns.
# If this is the case, we need to modify this testing dataframe so that its columns
# and column order is consistent with the training dataframe.
if testing_dataframe.shape[1] == 249:
testing_dataframe = adjust_columns_application_test_csv_table(testing_dataframe)
# Create a list of the binary categorical features with at least one 'NaN' entry
bin_cat_feat_with_nan_testing = testing_dataframe[all_bin_cat_feat].columns[testing_dataframe[all_bin_cat_feat].isna().any()].tolist()
# Replace each 'NaN' value in each of these binary features with 0
testing_dataframe[bin_cat_feat_with_nan_testing] = testing_dataframe[bin_cat_feat_with_nan_testing].fillna(value=0)
# Transform each of the 21 features that need to be scaled to the range [0,1] using
# the min-max scaler fit on the training set.
testing_dataframe[feats_to_scale] = scaler.transform(testing_dataframe[feats_to_scale])
return testing_dataframe
# Step 19. Preprocess the test validation set.
X_test_final = test_set_preprocessing_pipeline(X_test_raw)
# Verify that the test validation dataframe has the expected number of columns after
# preprocessing its data. It is expected there are 184 binary categorical features,
# and 67 numerical features for a total of 251 features.
print('Test validation set preprocessing complete. The final test validation dataframe now has {} columns. Expected: 251.'.format(X_test_final.shape[1]))
# Lists of of probability predictions and classifier names.
# To be used to plot ROC curves of each classifier's prediction
# probabilities.
y_score_list = []
clf_label_list = []
# Step 1. Create an ROC area-under-curve scorer.
def roc_auc_scorer(y_targets, y_score):
"""
Calculates and returns the area under the ROC curve between
the true target values and the probability estimates of the
predicted values.
"""
# Calculate the performance score between 'y_true' and 'y_predict'
score = roc_auc_score(y_targets, y_score)
# Return the score
return score
# Step 2. Use the Gaussian Naive Bayes classifier to make predictions on
# the test validation set. Calculate the area under ROC curve score of
# these predictions.
# Fit a Gaussian Naive Bayes classifier to the training dataframe.
clf_naive_bayes = GaussianNB()
clf_naive_bayes.fit(X_train_final, y_train)
# The Naive Bayes estimates of probability of the positive class (TARGET=1):
# the probability estimate of each borrower making at least one late loan payment.
naive_bayes_y_score = clf_naive_bayes.predict_proba(X_test_final)[:, 1]
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
naive_bayes_roc_auc_score = roc_auc_scorer(y_test, naive_bayes_y_score)
# Add the Naive Bayes classifier's scores to the results list.
y_score_list.append(naive_bayes_y_score)
clf_label_list.append('Naive Bayes All Features')
print('Naive Bayes (All Features) test validation set predictions\' ROC AUC score: {}'.format(naive_bayes_roc_auc_score))
# Step 3. Create a method that performs GridSearchCV on a
# AdaBoost classifier learning algorithm to discover the highest
# scoring hyperparameter combination.
def find_best_hyperameters_adaboost(X_train, y_train):
"""
Performs grid search over the 'n_estimators' parameter of an AdaBoost
classifier trained on the input data [X_train, y_train].
"""
# Create an AdaBoost classifier object
clf = AdaBoostClassifier()
# Create a dictionary for the parameter 'n_estimators' with different values
# that will be attempted.
params = {
'learning_rate': [0.01, 0.1, 1.0],
'n_estimators': [200, 250, 500,1000],
'random_state': [42]
}
# Transform 'roc_auc_scorer' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(roc_auc_scorer)
# Create a GridSearchCV object.
grid = GridSearchCV(clf, params, scoring_fnc, cv=3)
# Fit the grid search object to the data to compute the optimal model
grid = grid.fit(X_train, y_train)
# Return the optimal model after fitting the data
return grid.best_estimator_
# The AdaBoost classifier with hyperparameter values that scored the best
# in GridSearchCV.
clf_AdaBoost = find_best_hyperameters_adaboost(X_train_final, y_train)
print('Highest scoring AdaBoost classifier after running GridSearchCV: {}'.format(clf_AdaBoost))
# Step 4. Use the AdaBoost classifier to make predictions on
# the test validation set. Calculate the area under ROC curve score of
# these predictions.
# The AdaBoost classifier's estimates of probability of the positive class (TARGET=1):
# the probability estimate of each borrower making at least one late loan payment.
adaBoost_y_score = clf_AdaBoost.predict_proba(X_test_final)[:, 1]
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
adaBoost_roc_auc_score = roc_auc_scorer(y_test, adaBoost_y_score)
# Add the AdaBoost classifier's scores to the results list.
y_score_list.append(adaBoost_y_score)
clf_label_list.append('AdaBoost All Features')
print('AdaBoost (All Features) test validation set predictions\' ROC AUC score: {}'.format(adaBoost_roc_auc_score))
# Determine and display feature importances as determined by the AdaBoost classifier after
# it was fit on the full featureset.
feature_list = X_train_final.columns.values
feature_importance_list = clf_AdaBoost.feature_importances_
rows_list = []
for i in range(len(feature_importance_list)):
if feature_importance_list[i] > 0:
dictionary = {}
dictionary['Feature Name'] = feature_list[i]
dictionary['Importance'] = feature_importance_list[i]
rows_list.append(dictionary)
adaBoost_feature_importances = pd.DataFrame(rows_list, columns=['Feature Name', 'Importance'])
adaBoost_feature_importances = adaBoost_feature_importances.sort_values('Importance',ascending=False)
display(adaBoost_feature_importances)
# Step 5. Try using a Logistic Regression classifier to make predictions.
# Fit the classifier to the training data.
clf_logistic_regression = LogisticRegression(penalty='l2', random_state=42, solver='liblinear')
clf_logistic_regression.fit(X_train_final, y_train)
# The logistical regression classifier's estimates of probability of the positive
# class (TARGET=1): the probability estimate of each borrower making at least one
# late loan payment.
logistic_regression_y_score = clf_logistic_regression.predict_proba(X_test_final)[:, 1]
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
logistic_regression_roc_auc_score = roc_auc_scorer(y_test, logistic_regression_y_score)
# Add the Logistic Regression classifier's scores to the results list.
y_score_list.append(logistic_regression_y_score)
clf_label_list.append('Logistic Regression All Features')
print('Logistic Regression (All Features) test validation set predictions\' ROC AUC score: {}'.format(logistic_regression_roc_auc_score))
# Step 6. Try using a Multi-layer Perceptron classifier to make predictions.
# Fit the classifier to the training data.
clf_mlp = MLPClassifier(
hidden_layer_sizes=100, activation='identity', solver='adam', alpha=0.001, batch_size=200,
learning_rate_init=0.001, random_state=42
)
clf_mlp.fit(X_train_final, y_train)
# The multi-layer perceptron classifier's estimates of probability of the positive
# class (TARGET=1): the probability estimate of each borrower making at least one
# late loan payment.
mlp_y_score = clf_mlp.predict_proba(X_test_final)[:, 1]
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
mlp_roc_auc_score = roc_auc_scorer(y_test, mlp_y_score)
# Add the Multi-Layer Perceptron classifier's scores to the results list.
y_score_list.append(mlp_y_score)
clf_label_list.append('Multi-Layer Perceptron All Features')
print('Multi-layer Perceptron (All Features) test validation set predictions\' ROC AUC score: {}'.format(mlp_roc_auc_score))
# Step 7. Try using a LightGBM classifier.
# Convert preprocessed training dataset into LightGBM dataset format
lightgbm_training = lgb.Dataset(X_train_final, label=y_train)
# Specify parameters
params = {}
params['learning_rate'] = 0.01
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'auc'
params['sub_feature'] = 0.3
params['num_leaves'] = 100
params['min_data_in_leaf'] = 500
params['max_depth'] = 10
params['max_bin'] = 64
#params['min_data_in_bin'] = 3
#params['lambda_l1'] = 0.01
params['lambda_l2'] = 0.01
#params['min_gain_to_split'] = 0.01
params['bagging_freq'] = 100
params['bagging_fraction'] = 0.9
#params['feature_fraction'] = 0.5
# Fit the LightGBM classifier to the training data
clf_lgb = lgb.train(params, lightgbm_training, 1500)
# Classifier's estimates of probability of the positive class (TARGET=1): the
# probability estimate of each borrower making at least one late loan payment.
lgb_y_score = clf_lgb.predict(X_test_final)
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
lgb_roc_auc_score = roc_auc_scorer(y_test, lgb_y_score)
# Add the LightGBM classifier's scores to the results list.
y_score_list.append(lgb_y_score)
clf_label_list.append('LightGBM All Features')
print('LightGBM (All Features) test validation set predictions\' ROC AUC score: {}'.format(lgb_roc_auc_score))
# Step 7. Build a prediction pipeline for the testing data table (application_test.csv) that
# saves prediction probabilities to a CSV file, which will then be submitted on Kaggle.
# def testing_data_table_predictions_to_csv(clf, testing_data_table, isLightGBM):
# """
# A prediction pipeline that:
# 1. Preprocesses the 48,744 row testing data table
# 2. Uses a classifier to compute estimates of the probability of the positive
# class (TARGET=1) for each borrower: the probability estimate of each borrower
# making at least one late loan payment.
# 3. Saves a CSV file that contains probabilities of target labels for each
# borrower (SK_ID_CURR) in the testing data table.
# 4. isLightGBM: Boolean, a flag that indicates whether or not the classifier is
# LightGBM. If True,
# Parameters:
# clf: A machine learning classifier object that has already been fit to
# the training data.
# testing_data_table: Pandas dataframe containing the testing dataset.
# """
# # Get a list of the borrower IDs (SK_ID_CURR column). The borrower ID must be
# # placed in each row of CSV file that will be created.
# borrower_IDs = testing_data_table['SK_ID_CURR']
# # Preprocess the testing data table so that predictions can be made on it.
# X_test_final = test_set_preprocessing_pipeline(testing_data_table)
# #print('application_test.csv testing set processing complete. The processed dataframe now has {} columns. Expected: 251.'.format(X_test_final.shape[1]))
# # Classifier's estimates of probability of the positive class (TARGET=1): the
# # probability estimate of each borrower making at least one late loan payment.
# # If classifier is LightGBM, the method for making predictions is merely 'predict'
# # and the arrray containing these probabilities has slightly different shape than
# # those produced by the other classifiers.
# if isLightGBM:
# clf_y_score = clf.predict(X_test_final)
# else:
# clf_y_score = clf.predict_proba(X_test_final)[:, 1]
# # Create the CSV file that will be saved
# file_output = 'dellinger_kaggle_home_credit_submission2.csv'
# # Write to the CSV file
# with open(file_output, 'w') as csvfile:
# writer = csv.writer(csvfile)
# # Write the header row
# writer.writerow(['SK_ID_CURR','TARGET'])
# # Write a row for each borrower that contains the
# # prediction probability of their label.
# for index, value in borrower_IDs.iteritems():
# writer.writerow([value, clf_y_score[index]])
# # To submit to Kaggle: the LightGBM Classifier's predictions on full featureset.
# # Create predictions on the data in the testing data table (application_test.csv)
# # using the LightGBM classifier fit above in Step 5. Also create a CSV
# # file containing the prediction probabilities for each borrower ID (SK_ID_CURR)
# # in the testing data table.
# testing_data_table_predictions_to_csv(clf_lgb, application_test_data, True)
# Step 1. Now try training all classifiers on a featureset where PCA is used to compress the
# dimensions of the 67 numerical features.
# Load the main data tables
application_train_data = pd.read_csv("data/application_train.csv")
application_test_data = pd.read_csv("data/application_test.csv")
# Load the Bureau data table
bureau_data = pd.read_csv("data/bureau.csv")
# 1: Create lists of different feature types in the main data
# frame, based on how each type will need to be preprocessed.
# i. All 18 categorical features needing one-hot encoding.
# Includes the 4 categorical features originally
# mis-identified as having been normalized:
# EMERGENCYSTATE_MODE, HOUSETYPE_MODE, WALLSMATERIAL_MODE,
# FONDKAPREMONT_MODE
cat_feat_need_one_hot = [
'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
'FLAG_OWN_REALTY', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE',
'NAME_TYPE_SUITE', 'OCCUPATION_TYPE', 'EMERGENCYSTATE_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'FONDKAPREMONT_MODE'
]
# ii. All 32 binary categorical features already one-hot encoded.
bin_cat_feat = [
'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'
]
# iii. All 2 non-normalized numerical features with skewed distributions
# and negative values. These features will need to have their
# distributions translated to positive ranges before being
# log-transformed, and then later scaled to the range [0,1].
non_norm_feat_neg_values_skewed = [
'DAYS_REGISTRATION', 'DAYS_LAST_PHONE_CHANGE'
]
# iv. All 15 non-normalized numerical features with skewed distributions,
# and only positive values. These features will need to be
# log-transformed, and eventually scaled to the range [0,1].
non_norm_feat_pos_values_skewed = [
'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
'AMT_GOODS_PRICE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'OWN_CAR_AGE'
]
# v. All 4 numerical features with normal shapes but needing to be scaled
# to the range [0,1].
norm_feat_need_scaling = [
'DAYS_BIRTH', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START',
'REGION_POPULATION_RELATIVE'
]
# vi. All 46 numerical features that have been normalized to the range
# [0,1]. These features will need neither log-transformation, nor
# any further scaling.
norm_feat_not_need_scaling = [
'EXT_SOURCE_2', 'EXT_SOURCE_3', 'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BEGINEXPLUATATION_MEDI', 'FLOORSMAX_AVG',
'FLOORSMAX_MODE', 'FLOORSMAX_MEDI', 'LIVINGAREA_AVG',
'LIVINGAREA_MODE', 'LIVINGAREA_MEDI', 'ENTRANCES_AVG',
'ENTRANCES_MODE', 'ENTRANCES_MEDI', 'APARTMENTS_AVG',
'APARTMENTS_MODE', 'APARTMENTS_MEDI', 'ELEVATORS_AVG',
'ELEVATORS_MODE', 'ELEVATORS_MEDI', 'NONLIVINGAREA_AVG',
'NONLIVINGAREA_MODE', 'NONLIVINGAREA_MEDI', 'EXT_SOURCE_1',
'BASEMENTAREA_AVG', 'BASEMENTAREA_MODE', 'BASEMENTAREA_MEDI',
'LANDAREA_AVG', 'LANDAREA_MODE', 'LANDAREA_MEDI',
'YEARS_BUILD_AVG', 'YEARS_BUILD_MODE', 'YEARS_BUILD_MEDI',
'FLOORSMIN_AVG', 'FLOORSMIN_MODE', 'FLOORSMIN_MEDI',
'LIVINGAPARTMENTS_AVG', 'LIVINGAPARTMENTS_MODE', 'LIVINGAPARTMENTS_MEDI',
'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_MEDI',
'COMMONAREA_AVG', 'COMMONAREA_MODE', 'COMMONAREA_MEDI',
'TOTALAREA_MODE'
]
# vii. The remaining 3 features in the main data frame that will be
# re-engineered and transformed into different features
feat_to_be_reengineered = [
'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'DAYS_EMPLOYED'
]
# 2: Separate target data from training dataset.
targets = application_train_data['TARGET']
features_raw = application_train_data.drop('TARGET', axis = 1)
# 3: Use train_test_split from sklearn.cross_validation to
# create a test validation set that is 20% of the size of the total training set:
# Will allow me to compare performance of various learning algorithms without
# overfitting to the training data.
X_train_raw, X_test_raw, y_train, y_test = train_test_split(features_raw,
targets,
test_size = 0.2,
random_state = 42)
# 4: Use the CNT_CHILDREN feature to engineer a binary
# categorical feature called HAS_CHILDREN. If value of CNT_CHILDREN is
# greater than 0, the value of HAS_CHILDREN will be 1. If value of CNT_CHILDREN is
# 0, value of HAS_CHILDREN will be 0.
CNT_CHILDREN_train = X_train_raw['CNT_CHILDREN']
HAS_CHILDREN = CNT_CHILDREN_train.map(lambda x: 1 if x > 0 else 0)
# Append the newly engineered HAS_CHILDREN feature to the main dataframe.
X_train_raw = X_train_raw.assign(HAS_CHILDREN=HAS_CHILDREN.values)
# 5: Drop the CNT_CHILDREN column from the main dataframe
X_train_raw = X_train_raw.drop('CNT_CHILDREN', axis=1)
# Add the new HAS_CHILDREN feature to the list of binary categorical
# features that are already one-hot encoded. There are now 33 such features.
bin_cat_feat = bin_cat_feat + ['HAS_CHILDREN']
# 6. Use the CNT_FAM_MEMBERS feature to engineer a categorical feature called NUMBER_FAMILY_MEMBERS.
# If CNT_FAM_MEMBERS is 1.0, then the value of NUMBER_FAMILY_MEMBERS will be 'one'. If CNT_FAM_MEMBERS is 2.0,
# then NUMBER_FAMILY_MEMBERS will be 'two'. If CNT_FAM_MEMBERS is 3.0 or greater, then NUMBER_FAMILY_MEMBERS will
# be 'three_plus'.
CNT_FAM_MEMBERS_train = X_train_raw['CNT_FAM_MEMBERS']
NUMBER_FAMILY_MEMBERS = CNT_FAM_MEMBERS_train.map(lambda x: 'one' if x == 1 else ('two' if x == 2 else 'three_plus'))
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
X_train_raw = X_train_raw.assign(NUMBER_FAMILY_MEMBERS=NUMBER_FAMILY_MEMBERS.values)
# 7. Drop the CNT_FAM_MEMBERS feature from the main dataframe
X_train_raw = X_train_raw.drop('CNT_FAM_MEMBERS', axis=1)
# Add the new NUMBER_FAMILY_MEMBERS feature to the list of categorical
# features that will need to be one-hot encoded. There are now 19 of these features.
cat_feat_need_one_hot = cat_feat_need_one_hot + ['NUMBER_FAMILY_MEMBERS']
# 8. Use the CREDIT_DAY_OVERDUE feature in bureau.csv to engineer the binary
# categorical HAS_CREDIT_BUREAU_LOANS_OVERDUE feature. If CREDIT_DAY_OVERDUE for a
# particular borrower ID (SK_ID_CURR) is greater than 0, then the value of
# HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 1. If CREDIT_DAY_OVERDUE for a particular
# borrower ID is 0, then the value of HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 0.
# Filter the bureau data table for loans which are overdue (have a value
# for CREDIT_DAY_OVERDUE that's greater than 0)
bureau_data_filtered_for_overdue = bureau_data[bureau_data['CREDIT_DAY_OVERDUE'] > 0]
def build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(dataframe):
"""
Use the CREDIT_DAY_OVERDUE feature in bureau.csv to engineer the binary
categorical HAS_CREDIT_BUREAU_LOANS_OVERDUE feature. If CREDIT_DAY_OVERDUE for a
particular borrower ID (SK_ID_CURR) is greater than 0, then the value of
HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 1. If CREDIT_DAY_OVERDUE for a particular
borrower ID is 0, then the value of HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 0.
Parameters:
dataframe: Pandas dataframe containing a training or testing dataset
Returns: The dataframe with HAS_CREDIT_BUREAU_LOANS_OVERDUE feature appended to it.
"""
# Create a series called HAS_CREDIT_BUREAU_LOANS_OVERDUE and fill it with zeros.
# Its index is identical to that of the main dataframe. It will eventually be appended
# to the main data frame as a column.
HAS_CREDIT_BUREAU_LOANS_OVERDUE = pd.Series(data=0, index = dataframe['SK_ID_CURR'].index)
# A list of all the borrowers IDs in the main dataframe
main_data_table_borrower_IDs = dataframe['SK_ID_CURR'].values
# For each loan in the bureau data table that is overdue
# (has a value for CREDIT_DAY_OVERDUE that's greater than 0)
for index, row in bureau_data_filtered_for_overdue.iterrows():
# The borrower ID (SK_ID_CURR) that owns the overdue loan
borrower_ID = row['SK_ID_CURR']
# If the borrower ID owning the overdue loan is also
# in the main data frame, then enter a value of 1 in
# the series HAS_CREDIT_BUREAU_LOANS_OVERDUE at an index
# that is identical to the index of the borrower ID
# in the main data frame.
if borrower_ID in main_data_table_borrower_IDs:
# The index of the borrower's row in the main data table.
borrower_index_main_data_table = dataframe.index[dataframe['SK_ID_CURR'] == borrower_ID].tolist()[0]
# Place a value of 1 at the index of the series HAS_CREDIT_BUREAU_LOANS_OVERDUE
# which corresponds to the index of the borrower's ID in the main data table.
HAS_CREDIT_BUREAU_LOANS_OVERDUE.loc[borrower_index_main_data_table] = 1
# Append the newly engineered HAS_CREDIT_BUREAU_LOANS_OVERDUE feature to the main dataframe.
dataframe = dataframe.assign(HAS_CREDIT_BUREAU_LOANS_OVERDUE=HAS_CREDIT_BUREAU_LOANS_OVERDUE.values)
return dataframe
# Build the HAS_CREDIT_BUREAU_LOANS_OVERDUE feature
X_train_raw = build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(X_train_raw)
# Add the new HAS_CREDIT_BUREAU_LOANS_OVERDUE feature to the list of binary categorical
# features. There are now 34 of these features.
bin_cat_feat = bin_cat_feat + ['HAS_CREDIT_BUREAU_LOANS_OVERDUE']
# 9. Use the DAYS_EMPLOYED feature to engineer a binary categorical feature called HAS_JOB.
# If the value of DAYS_EMPLOYED is 0 or less, then HAS_JOB will be 1. Otherwise, HAS_JOB will
# be 0. This condition will apply to all borrowers who had a value of 365243 for DAYS_EMPLOYED,
# which I hypothesized can be best interpreted as meaning that a borrower does not have a job.
DAYS_EMPLOYED_train = X_train_raw['DAYS_EMPLOYED']
HAS_JOB = DAYS_EMPLOYED_train.map(lambda x: 1 if x <= 0 else 0)
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
X_train_raw = X_train_raw.assign(HAS_JOB=HAS_JOB.values)
# 10. Drop the DAYS_EMPLOYED feature from the main dataframe
X_train_raw = X_train_raw.drop('DAYS_EMPLOYED', axis=1)
# Add the new HAS_JOB feature to the list of binary categorical features.
# There are now 35 of these features.
bin_cat_feat = bin_cat_feat + ['HAS_JOB']
# 11. Translate the 2 non-normalized numerical features that have skewed distributions
# and negative values: DAYS_REGISTRATION, and DAYS_LAST_PHONE_CHANGE
def translate_negative_valued_features(dataframe, feature_name_list):
"""
Translate a dataset's continuous features containing several negative
values. The dataframe is modified such that all values of each feature
listed in the feature_name_list parameter become positive.
Parameters:
dataframe: Pandas dataframe containing the features
feature_name_list: List of strings, containing the names
of each feature whose values will be
translated
"""
for feature in feature_name_list:
# The minimum, most-negative, value of the feature
feature_min_value = dataframe[feature].min()
# Translate each value of the feature in a positive direction,
# of magnitude that's equal to the feature's most negative value.
dataframe[feature] = dataframe[feature].apply(lambda x: x - feature_min_value)
# Translate the above two negatively-valued features to positive values
translate_negative_valued_features(X_train_raw, non_norm_feat_neg_values_skewed)
# 12. Log-transform all 17 non-normalized numerical features that have skewed distributions.
# These 17 features include the 2 that were translated to positive ranges in Step 11.
# Add the 2 features translated to positive ranges above in Step 11 to
# the list of non-normalized skewed features with positive values. This is
# the set of features that will be log-transformed
log_transform_feats = non_norm_feat_pos_values_skewed + non_norm_feat_neg_values_skewed
X_train_raw[log_transform_feats] = X_train_raw[log_transform_feats].apply(lambda x: np.log(x + 1))
# 13. Replace 'NaN' values for all numerical features with each feature's mean. Fit an imputer
# to each numerical feature containing at least one 'NaN' entry.
# Create a list of all the 67 numerical features in the main dataframe. These include all
# 17 features that were log-transformed in Step 12, as well as the 4 normal features that
# still need to be scaled, as well as the 46 normal features that don't need scaling.
numerical_features = log_transform_feats + norm_feat_need_scaling + norm_feat_not_need_scaling
# Create a list of all numerical features in the training set that have at least one 'NaN' entry
numerical_features_with_nan = X_train_raw[numerical_features].columns[X_train_raw[numerical_features].isna().any()].tolist()
# Create an imputer
imputer = Imputer()
# Fit the imputer to each numerical feature in the training set that has 'NaN' values,
# and replace each 'NaN' entry of each feature with that feature's mean.
X_train_raw[numerical_features_with_nan] = imputer.fit_transform(X_train_raw[numerical_features_with_nan])
# 14. Remove the borrower ID column, SK_ID_CURR, from the main dataframe
X_train_raw = X_train_raw.drop('SK_ID_CURR', axis=1)
# 15. One-hot encode all 19 non-binary categorical features.
X_train_raw = pd.get_dummies(X_train_raw, columns=cat_feat_need_one_hot)
# Create a list that includes only the newly one-hot encoded features
# as well as all the categorical features that were already binary.
all_bin_cat_feat = X_train_raw.columns.tolist()
for column_name in X_train_raw[numerical_features].columns.tolist():
all_bin_cat_feat.remove(column_name)
# 16. Replace all 'NaN' values in all binary categorical features with 0.
# Create a list of binary categorical features with at least one 'NaN' entry
bin_cat_feat_with_nan = X_train_raw[all_bin_cat_feat].columns[X_train_raw[all_bin_cat_feat].isna().any()].tolist()
# Replace each 'NaN' value in each of these binary features with 0
X_train_raw[bin_cat_feat_with_nan] = X_train_raw[bin_cat_feat_with_nan].fillna(value=0)
# 17. Fit a min-max scaler to each of the 17 log-transformed numerical features, as well
# as to the 4 features DAYS_BIRTH, DAYS_ID_PUBLISH, HOUR_APPR_PROCESS_START, and the normalized
# feature REGION_POPULATION_RELATIVE. Each feature will be scaled to a range [0.0, 1.0].
# Build a list of all 21 features needing scaling. Add the list of features that
# were log-normalized above in Step 12 to the list of normally shaped features
# that need to be scaled to the range [0,1].
feats_to_scale = norm_feat_need_scaling + log_transform_feats
# Initialize a scaler with the default range of [0,1]
scaler = MinMaxScaler()
# Fit the scaler to each of the features of the train set that need to be scaled,
# then transform each of these features' values to the new scale.
X_train_raw[feats_to_scale] = scaler.fit_transform(X_train_raw[feats_to_scale])
# Rename the dataframe to indicate that its columns have been fully preprocessed.
X_train_processed = X_train_raw
# 18. Fit PCA on all numerical features and observe how many
# components explain approximately 90% of the variance in the data.
pca = PCA(n_components = 17)
pca.fit(X_train_processed[numerical_features])
# Number of components used for pca
n_components = len(pca.explained_variance_ratio_)
# The total percent explained variance of all components used in PCA
percent_explained_var_all_n_components = round(sum(pca.explained_variance_ratio_)*100, 2)
print('Explained variance ratios for each component:')
print(pca.explained_variance_ratio_)
print('\r')
print('{}% of variance of numerical features explained by {} components.'.format(percent_explained_var_all_n_components, n_components))
# Display the head of the dataframe created to store the values of the reduced numerical
# feature dimensions output by PCA.
display(X_train_reduced_numerical_features.head())
# 19. Use PCA to reduce the dimension space of the numerical features
# to the optimal number of principle components discovered above.
reduced_numerical_data = pca.transform(X_train_processed[numerical_features])
# Create a DataFrame for the reduced data
X_train_reduced_numerical_features = pd.DataFrame(reduced_numerical_data, index=X_train_processed.index, columns = [
'PCA Dimension 1', 'PCA Dimension 2', 'PCA Dimension 3',
'PCA Dimension 4', 'PCA Dimension 5', 'PCA Dimension 6',
'PCA Dimension 7', 'PCA Dimension 8', 'PCA Dimension 9',
'PCA Dimension 10', 'PCA Dimension 11', 'PCA Dimension 12',
'PCA Dimension 13', 'PCA Dimension 14', 'PCA Dimension 15',
'PCA Dimension 16', 'PCA Dimension 17'
])
# 20. Drop all 67 numerical features from the original preprocessed
# dataframe, so that it only contains the 184 binary categorical features.
# Append the dataframe containing the reduced numerical features back to
# this original dataframe.
# Drop the 67 original numerical features from the dataframe
X_train_processed = X_train_processed.drop(numerical_features, axis=1)
# Merge the dataframe with the dataframe containing the 17 reduced
# numerical features
X_train_final = pd.merge(left=X_train_processed, right=X_train_reduced_numerical_features, left_index=True, right_index=True)
# 21. Build a data preprocessing pipeline to used for all testing sets.
# This pipeline will recreate all features that were engineered in the
# training set during the original data preprocessing phase.
# The pipeline will also apply the imputer, min-max, and PCA transforms
# originally fit on features in the training set to all datapoints in a
# testing set.
def adjust_columns_application_test_csv_table(testing_dataframe):
"""
After it is one-hot encoded, application_test.csv data table will have one
extra column, 'REGION_RATING_CLIENT_W_CITY_-1', that is not present in the
training dataframe. This column will be removed from the testing datatable
in this case. Only 1 of the 48,744 rows in application_test.csv will have a
value of 1 for this feature following one-hot encoding. I am not worried
about this column's elimination from the testing dataframe affecting predictions.
Additionally, unlike the test validation set, which originally comprised 20% of
application_train.csv, application_test.csv will be missing the following columns
after it is one-hot encoded:
'CODE_GENDER_XNA', 'NAME_INCOME_TYPE_Maternity leave', 'NAME_FAMILY_STATUS_Unknown'
In this case, we need to insert these columns into the testing dataframe, at
the exact same indices they are located at in the fully preprocessed training
dataframe.Each inserted column will be filled with all zeros. (If each of these
binary features are missing from the application_test.csv data table, we can infer
that each borrower in thatdata table obviously would have a 0 for each feature were
it present.)
Parameters:
testing_dataframe: Pandas dataframe containing the testing dataset
contained in the file application_test.csv
Returns: a testing dataframe containing the exact same columns and
column order as found in the training dataframe
"""
# Identify any columns in the one-hot encoded testing_dataframe that
# are not in X_train_raw. These columns will need to be removed from the
# testing_dataframe. (Expected that there will only be one such
# column: 'REGION_RATING_CLIENT_W_CITY_-1')
X_train_columns_list = X_train_raw.columns.tolist()
testing_dataframe_columns_list = testing_dataframe.columns.tolist()
for column_name in X_train_columns_list:
if column_name in testing_dataframe_columns_list:
testing_dataframe_columns_list.remove(column_name)
columns_not_in_X_train_raw = testing_dataframe_columns_list
# Drop any column from the testing_dataframe that is not in the
# training dataframe. Expected to only be the one column 'REGION_RATING_CLIENT_W_CITY_-1'
for column in columns_not_in_X_train_raw:
testing_dataframe = testing_dataframe.drop(column, axis=1)
# Get the column indices of each of the features 'CODE_GENDER_XNA',
#'NAME_INCOME_TYPE_Maternity leave', 'NAME_FAMILY_STATUS_Unknown' from
# the raw training dataframe (X_train_raw) prior to having having PCA run on it.
loc_code_gender_training_frame = X_train_raw.columns.get_loc('CODE_GENDER_XNA')
loc_name_income_type_maternity_leave_training_frame = X_train_raw.columns.get_loc('NAME_INCOME_TYPE_Maternity leave')
loc_name_family_status_unknown_training_frame = X_train_raw.columns.get_loc('NAME_FAMILY_STATUS_Unknown')
# Insert each column into the testing dataframe at the same index it had
# in the X_train_raw dataframe before PCA was run. Fill each column with all 0s.
# Order is important. 'CODE_GENDER_XNA' should be inserted first, followed by
# 'NAME_INCOME_TYPE_Maternity leave', and then finally 'NAME_FAMILY_STATUS_Unknown'.
testing_dataframe.insert(loc=loc_code_gender_training_frame, column='CODE_GENDER_XNA', value=0)
testing_dataframe.insert(loc=loc_name_income_type_maternity_leave_training_frame, column='NAME_INCOME_TYPE_Maternity leave', value=0)
testing_dataframe.insert(loc=loc_name_family_status_unknown_training_frame, column='NAME_FAMILY_STATUS_Unknown', value=0)
return testing_dataframe
def test_set_preprocessing_pipeline(testing_dataframe):
"""
Recreate all features that were engineered in the training set during
the original data preprocessing phase. Missing numerical 'NaN' values
will be filled with an imputer. Missing binary categorical feature 'NaN'
values will be replaced with 0. The pipeline will also apply
the min-max and PCA transforms originally fit on features
in the training set to numerical features in the testing set.
Parameters:
testing_dataframe: Pandas dataframe containing a testing dataset
Returns: a fully preprocessed testing dataframe
"""
# Create the HAS_CHILDREN feature.
CNT_CHILDREN_test = testing_dataframe['CNT_CHILDREN']
HAS_CHILDREN = CNT_CHILDREN_test.map(lambda x: 1 if x > 0 else 0)
# Append the newly engineered HAS_CHILDREN feature to the main dataframe.
testing_dataframe = testing_dataframe.assign(HAS_CHILDREN=HAS_CHILDREN.values)
# Drop the CNT_CHILDREN column from the main dataframe
testing_dataframe = testing_dataframe.drop('CNT_CHILDREN', axis=1)
# Create the NUMBER_FAMILY_MEMBERS feature.
CNT_FAM_MEMBERS_test = testing_dataframe['CNT_FAM_MEMBERS']
NUMBER_FAMILY_MEMBERS = CNT_FAM_MEMBERS_test.map(lambda x: 'one' if x == 1 else ('two' if x == 2 else 'three_plus'))
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
testing_dataframe = testing_dataframe.assign(NUMBER_FAMILY_MEMBERS=NUMBER_FAMILY_MEMBERS.values)
# Drop the CNT_FAM_MEMBERS feature from the main dataframe
testing_dataframe = testing_dataframe.drop('CNT_FAM_MEMBERS', axis=1)
# Build the HAS_CREDIT_BUREAU_LOANS_OVERDUE feature
testing_dataframe = build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(testing_dataframe)
# Create the HAS_JOB feature
DAYS_EMPLOYED_test = testing_dataframe['DAYS_EMPLOYED']
HAS_JOB = DAYS_EMPLOYED_test.map(lambda x: 1 if x <= 0 else 0)
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
testing_dataframe = testing_dataframe.assign(HAS_JOB=HAS_JOB.values)
# Drop the DAYS_EMPLOYED feature from the main dataframe
testing_dataframe = testing_dataframe.drop('DAYS_EMPLOYED', axis=1)
# Translate the two negatively-valued features DAYS_REGISTRATION, and
# DAYS_LAST_PHONE_CHANGE to positive values
translate_negative_valued_features(testing_dataframe, non_norm_feat_neg_values_skewed)
# Log-transform all 17 non-normalized numerical features that have skewed distributions.
testing_dataframe[log_transform_feats] = testing_dataframe[log_transform_feats].apply(lambda x: np.log(x + 1))
# Create a list of all numerical features in the testing dataframe that have at least one 'NaN' entry
numerical_features_with_nan_testing = testing_dataframe[numerical_features].columns[testing_dataframe[numerical_features].isna().any()].tolist()
# Use an imputer to replace 'NaN' values for all numerical features with each feature's mean.
testing_dataframe[numerical_features_with_nan_testing] = imputer.fit_transform(testing_dataframe[numerical_features_with_nan_testing])
# Remove the borrower ID column, SK_ID_CURR, from the main dataframe
testing_dataframe = testing_dataframe.drop('SK_ID_CURR', axis=1)
# One-hot encode all 19 non-binary categorical features.
testing_dataframe = pd.get_dummies(testing_dataframe, columns=cat_feat_need_one_hot)
# After one-hot encoding, the testing dataframe from application_test.csv will be
# missing 2 columns that are in the training dataframe. It will also have an extra
# column that was not in the training dataframe, giving it 249 total columns.
# If this is the case, we need to modify this testing dataframe so that its columns
# and column order is consistent with the training dataframe.
if testing_dataframe.shape[1] == 249:
testing_dataframe = adjust_columns_application_test_csv_table(testing_dataframe)
# Create a list of the binary categorical features with at least one 'NaN' entry
bin_cat_feat_with_nan_testing = testing_dataframe[all_bin_cat_feat].columns[testing_dataframe[all_bin_cat_feat].isna().any()].tolist()
# Replace each 'NaN' value in each of these binary features with 0
testing_dataframe[bin_cat_feat_with_nan_testing] = testing_dataframe[bin_cat_feat_with_nan_testing].fillna(value=0)
# Transform each of the 21 features that need to be scaled to the range [0,1] using
# the min-max scaler fit on the training set.
testing_dataframe[feats_to_scale] = scaler.transform(testing_dataframe[feats_to_scale])
# Use the PCA algorithm fit on the training set to reduce the dimension space of
# the numerical features in the testing set.
reduced_numerical_data_testing = pca.transform(testing_dataframe[numerical_features])
# Create a DataFrame for the reduced data
testing_dataframe_reduced_numerical_features = pd.DataFrame(reduced_numerical_data_testing, index=testing_dataframe.index, columns = [
'PCA Dimension 1', 'PCA Dimension 2', 'PCA Dimension 3',
'PCA Dimension 4', 'PCA Dimension 5', 'PCA Dimension 6',
'PCA Dimension 7', 'PCA Dimension 8', 'PCA Dimension 9',
'PCA Dimension 10', 'PCA Dimension 11', 'PCA Dimension 12',
'PCA Dimension 13', 'PCA Dimension 14', 'PCA Dimension 15',
'PCA Dimension 16', 'PCA Dimension 17'
])
# Drop the 67 original numerical features from the dataframe
testing_dataframe = testing_dataframe.drop(numerical_features, axis=1)
# Merge the dataframe with the dataframe containing the 17 reduced
# numerical features.
testing_dataframe = pd.merge(left=testing_dataframe, right=testing_dataframe_reduced_numerical_features, left_index=True, right_index=True)
# Return the fully preprocessed testing dataframe
return testing_dataframe
# 22. Preprocess the test validation set.
X_test_final = test_set_preprocessing_pipeline(X_test_raw)
# Verify that both the training and test validation dataframes have the expected number of columns after
# preprocessing its data and transforming their features using the PCA algorithm that was fit on the training
# data's numerical features. It is expected there are 184 binary categorical features, and 17 reduced numerical
# features for a total of 201 features.
print('Training set preprocessing complete. The final training dataframe now has {} columns. Expected: 201.'.format(X_train_final.shape[1]))
print('Test validation set preprocessing complete. The final test validation dataframe now has {} columns. Expected: 201.'.format(X_test_final.shape[1]))
# Train the classifiers and compute prediction probabilities:
# 1. Use a Gaussian Naive Bayes classifier to make predictions on
# the test validation set. Calculate the area under ROC curve score of
# these predictions.
# Fit a Gaussian Naive Bayes classifier to the training dataframe.
clf_naive_bayes = GaussianNB()
clf_naive_bayes.fit(X_train_final, y_train)
# The Naive Bayes estimates of probability of the positive class (TARGET=1):
# the probability estimate of each borrower making at least one late loan payment.
naive_bayes_PCA_y_score = clf_naive_bayes.predict_proba(X_test_final)[:, 1]
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
naive_bayes_PCA_roc_auc_score = roc_auc_scorer(y_test, naive_bayes_PCA_y_score)
# Add the Naive Bayes classifier's scores to the results list.
y_score_list.append(naive_bayes_PCA_y_score)
clf_label_list.append('Naive Bayes PCA')
print('Naive Bayes (PCA) test validation set predictions\' ROC AUC score: {}'.format(naive_bayes_PCA_roc_auc_score))
# 2. Use an AdaBoost classifier to make predictions on the test validation set.
# Calculate the area under ROC curve score of these predictions.
# Fit the AdaBoost classifier, using the parameter for 'n_estimators' discovered
# when running GridSearchCV on AdaBoost above for the full featureset.
clf_AdaBoost = AdaBoostClassifier(learning_rate=1.0, n_estimators=1000, random_state=42)
clf_AdaBoost.fit(X_train_final, y_train)
# The AdaBoost classifier's estimates of probability of the positive class (TARGET=1):
# the probability estimate of each borrower making at least one late loan payment.
adaBoost_PCA_y_score = clf_AdaBoost.predict_proba(X_test_final)[:, 1]
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
adaBoost_PCA_roc_auc_score = roc_auc_scorer(y_test, adaBoost_PCA_y_score)
# Add the AdaBoost classifier's scores to the results list.
y_score_list.append(adaBoost_PCA_y_score)
clf_label_list.append('AdaBoost PCA')
print('AdaBoost (PCA) test validation set predictions\' ROC AUC score: {}'.format(adaBoost_PCA_roc_auc_score))
# 3. Try using a Logistic Regression classifier to make predictions.
# Fit the classifier to the training data.
clf_logistic_regression = LogisticRegression(penalty='l1', random_state=42, solver='liblinear', max_iter=100)
clf_logistic_regression.fit(X_train_final, y_train)
# The logistical regression classifier's estimates of probability of the positive
# class (TARGET=1): the probability estimate of each borrower making at least one
# late loan payment.
logistic_regression_PCA_y_score = clf_logistic_regression.predict_proba(X_test_final)[:, 1]
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
logistic_regression_PCA_roc_auc_score = roc_auc_scorer(y_test, logistic_regression_PCA_y_score)
# Add the Logistic Regression classifier's scores to the results list.
y_score_list.append(logistic_regression_PCA_y_score)
clf_label_list.append('Logistic Regression PCA')
print('Logistic Regression (PCA) test validation set predictions\' ROC AUC score: {}'.format(logistic_regression_PCA_roc_auc_score))
# 4. Try using a Multi-layer Perceptron classifier to make predictions.
# Fit the classifier to the training data.
clf_mlp = MLPClassifier(
hidden_layer_sizes=100, activation='identity', solver='adam', alpha=0.001, batch_size=200,
learning_rate_init=0.001, random_state=42, warm_start=False
)
clf_mlp.fit(X_train_final, y_train)
# The multi-layer perceptron classifier's estimates of probability of the positive
# class (TARGET=1): the probability estimate of each borrower making at least one
# late loan payment.
mlp_PCA_y_score = clf_mlp.predict_proba(X_test_final)[:, 1]
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
mlp_PCA_roc_auc_score = roc_auc_scorer(y_test, mlp_PCA_y_score)
# Add the Multi-Layer Perceptron classifier's scores to the results list.
y_score_list.append(mlp_PCA_y_score)
clf_label_list.append('Multi-Layer Perceptron PCA')
print('Multi-layer Perceptron (PCA) test validation set predictions\' ROC AUC score: {}'.format(mlp_PCA_roc_auc_score))
# 5. Try using a LightGBM classifier.
# Convert preprocessed training dataset into LightGBM dataset format
lightgbm_training = lgb.Dataset(X_train_final, label=y_train)
# Specify parameters
params = {}
params['learning_rate'] = 0.01
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'auc'
params['sub_feature'] = 0.3
params['num_leaves'] = 100
params['min_data_in_leaf'] = 500
params['max_depth'] = 10
params['max_bin'] = 64
#params['min_data_in_bin'] = 3
#params['lambda_l1'] = 0.01
params['lambda_l2'] = 0.01
#params['min_gain_to_split'] = 0.01
params['bagging_freq'] = 100
params['bagging_fraction'] = 0.9
#params['feature_fraction'] = 0.5
# Fit the LightGBM classifier to the training data
clf_lgb = lgb.train(params, lightgbm_training, 1500)
# Classifier's estimates of probability of the positive class (TARGET=1): the
# probability estimate of each borrower making at least one late loan payment.
lgb_PCA_y_score = clf_lgb.predict(X_test_final)
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
lgb_PCA_roc_auc_score = roc_auc_scorer(y_test, lgb_PCA_y_score)
# Add the LightGBM classifier's scores to the results list.
y_score_list.append(lgb_PCA_y_score)
clf_label_list.append('LightGBM PCA')
print('LightGBM (PCA) test validation set predictions\' ROC AUC score: {}'.format(lgb_PCA_roc_auc_score))
# Step 2. Try training all classifiers on a featureset where SelectKBest feature selection
# has been used to narrow down the full featureset's 251 features to the best performing features.
# Preprocess the dataset similar to how it was done above. However, this time use SelectKBest
# to only use a portion of the 251 features that exist after one-hot encoding.
# Load the main data tables
application_train_data = pd.read_csv("data/application_train.csv")
application_test_data = pd.read_csv("data/application_test.csv")
# Load the Bureau data table
bureau_data = pd.read_csv("data/bureau.csv")
# 1: Create lists of different feature types in the main data
# frame, based on how each type will need to be preprocessed.
# i. All 18 categorical features needing one-hot encoding.
# Includes the 4 categorical features originally
# mis-identified as having been normalized:
# EMERGENCYSTATE_MODE, HOUSETYPE_MODE, WALLSMATERIAL_MODE,
# FONDKAPREMONT_MODE
cat_feat_need_one_hot = [
'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
'FLAG_OWN_REALTY', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE',
'NAME_TYPE_SUITE', 'OCCUPATION_TYPE', 'EMERGENCYSTATE_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'FONDKAPREMONT_MODE'
]
# ii. All 32 binary categorical features already one-hot encoded.
bin_cat_feat = [
'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'
]
# iii. All 2 non-normalized numerical features with skewed distributions
# and negative values. These features will need to have their
# distributions translated to positive ranges before being
# log-transformed, and then later scaled to the range [0,1].
non_norm_feat_neg_values_skewed = [
'DAYS_REGISTRATION', 'DAYS_LAST_PHONE_CHANGE'
]
# iv. All 15 non-normalized numerical features with skewed distributions,
# and only positive values. These features will need to be
# log-transformed, and eventually scaled to the range [0,1].
non_norm_feat_pos_values_skewed = [
'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
'AMT_GOODS_PRICE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'OWN_CAR_AGE'
]
# v. All 4 numerical features with normal shapes but needing to be scaled
# to the range [0,1].
norm_feat_need_scaling = [
'DAYS_BIRTH', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START',
'REGION_POPULATION_RELATIVE'
]
# vi. All 46 numerical features that have been normalized to the range
# [0,1]. These features will need neither log-transformation, nor
# any further scaling.
norm_feat_not_need_scaling = [
'EXT_SOURCE_2', 'EXT_SOURCE_3', 'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BEGINEXPLUATATION_MEDI', 'FLOORSMAX_AVG',
'FLOORSMAX_MODE', 'FLOORSMAX_MEDI', 'LIVINGAREA_AVG',
'LIVINGAREA_MODE', 'LIVINGAREA_MEDI', 'ENTRANCES_AVG',
'ENTRANCES_MODE', 'ENTRANCES_MEDI', 'APARTMENTS_AVG',
'APARTMENTS_MODE', 'APARTMENTS_MEDI', 'ELEVATORS_AVG',
'ELEVATORS_MODE', 'ELEVATORS_MEDI', 'NONLIVINGAREA_AVG',
'NONLIVINGAREA_MODE', 'NONLIVINGAREA_MEDI', 'EXT_SOURCE_1',
'BASEMENTAREA_AVG', 'BASEMENTAREA_MODE', 'BASEMENTAREA_MEDI',
'LANDAREA_AVG', 'LANDAREA_MODE', 'LANDAREA_MEDI',
'YEARS_BUILD_AVG', 'YEARS_BUILD_MODE', 'YEARS_BUILD_MEDI',
'FLOORSMIN_AVG', 'FLOORSMIN_MODE', 'FLOORSMIN_MEDI',
'LIVINGAPARTMENTS_AVG', 'LIVINGAPARTMENTS_MODE', 'LIVINGAPARTMENTS_MEDI',
'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_MEDI',
'COMMONAREA_AVG', 'COMMONAREA_MODE', 'COMMONAREA_MEDI',
'TOTALAREA_MODE'
]
# vii. The remaining 3 features in the main data frame that will be
# re-engineered and transformed into different features
feat_to_be_reengineered = [
'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'DAYS_EMPLOYED'
]
# 2: Separate target data from training dataset.
targets = application_train_data['TARGET']
features_raw = application_train_data.drop('TARGET', axis = 1)
# 3: Use train_test_split from sklearn.cross_validation to
# create a test validation set that is 20% of the size of the total training set:
# Will allow me to compare performance of various learning algorithms without
# overfitting to the training data.
X_train_raw, X_test_raw, y_train, y_test = train_test_split(features_raw,
targets,
test_size = 0.2,
random_state = 42)
# 4: Use the CNT_CHILDREN feature to engineer a binary
# categorical feature called HAS_CHILDREN. If value of CNT_CHILDREN is
# greater than 0, the value of HAS_CHILDREN will be 1. If value of CNT_CHILDREN is
# 0, value of HAS_CHILDREN will be 0.
CNT_CHILDREN_train = X_train_raw['CNT_CHILDREN']
HAS_CHILDREN = CNT_CHILDREN_train.map(lambda x: 1 if x > 0 else 0)
# Append the newly engineered HAS_CHILDREN feature to the main dataframe.
X_train_raw = X_train_raw.assign(HAS_CHILDREN=HAS_CHILDREN.values)
# 5: Drop the CNT_CHILDREN column from the main dataframe
X_train_raw = X_train_raw.drop('CNT_CHILDREN', axis=1)
# Add the new HAS_CHILDREN feature to the list of binary categorical
# features that are already one-hot encoded. There are now 33 such features.
bin_cat_feat = bin_cat_feat + ['HAS_CHILDREN']
# 6. Use the CNT_FAM_MEMBERS feature to engineer a categorical feature called NUMBER_FAMILY_MEMBERS.
# If CNT_FAM_MEMBERS is 1.0, then the value of NUMBER_FAMILY_MEMBERS will be 'one'. If CNT_FAM_MEMBERS is 2.0,
# then NUMBER_FAMILY_MEMBERS will be 'two'. If CNT_FAM_MEMBERS is 3.0 or greater, then NUMBER_FAMILY_MEMBERS will
# be 'three_plus'.
CNT_FAM_MEMBERS_train = X_train_raw['CNT_FAM_MEMBERS']
NUMBER_FAMILY_MEMBERS = CNT_FAM_MEMBERS_train.map(lambda x: 'one' if x == 1 else ('two' if x == 2 else 'three_plus'))
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
X_train_raw = X_train_raw.assign(NUMBER_FAMILY_MEMBERS=NUMBER_FAMILY_MEMBERS.values)
# 7. Drop the CNT_FAM_MEMBERS feature from the main dataframe
X_train_raw = X_train_raw.drop('CNT_FAM_MEMBERS', axis=1)
# Add the new NUMBER_FAMILY_MEMBERS feature to the list of categorical
# features that will need to be one-hot encoded. There are now 19 of these features.
cat_feat_need_one_hot = cat_feat_need_one_hot + ['NUMBER_FAMILY_MEMBERS']
# 8. Use the CREDIT_DAY_OVERDUE feature in bureau.csv to engineer the binary
# categorical HAS_CREDIT_BUREAU_LOANS_OVERDUE feature. If CREDIT_DAY_OVERDUE for a
# particular borrower ID (SK_ID_CURR) is greater than 0, then the value of
# HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 1. If CREDIT_DAY_OVERDUE for a particular
# borrower ID is 0, then the value of HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 0.
# Filter the bureau data table for loans which are overdue (have a value
# for CREDIT_DAY_OVERDUE that's greater than 0)
bureau_data_filtered_for_overdue = bureau_data[bureau_data['CREDIT_DAY_OVERDUE'] > 0]
def build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(dataframe):
"""
Use the CREDIT_DAY_OVERDUE feature in bureau.csv to engineer the binary
categorical HAS_CREDIT_BUREAU_LOANS_OVERDUE feature. If CREDIT_DAY_OVERDUE for a
particular borrower ID (SK_ID_CURR) is greater than 0, then the value of
HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 1. If CREDIT_DAY_OVERDUE for a particular
borrower ID is 0, then the value of HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 0.
Parameters:
dataframe: Pandas dataframe containing a training or testing dataset
Returns: The dataframe with HAS_CREDIT_BUREAU_LOANS_OVERDUE feature appended to it.
"""
# Create a series called HAS_CREDIT_BUREAU_LOANS_OVERDUE and fill it with zeros.
# Its index is identical to that of the main dataframe. It will eventually be appended
# to the main data frame as a column.
HAS_CREDIT_BUREAU_LOANS_OVERDUE = pd.Series(data=0, index = dataframe['SK_ID_CURR'].index)
# A list of all the borrowers IDs in the main dataframe
main_data_table_borrower_IDs = dataframe['SK_ID_CURR'].values
# For each loan in the bureau data table that is overdue
# (has a value for CREDIT_DAY_OVERDUE that's greater than 0)
for index, row in bureau_data_filtered_for_overdue.iterrows():
# The borrower ID (SK_ID_CURR) that owns the overdue loan
borrower_ID = row['SK_ID_CURR']
# If the borrower ID owning the overdue loan is also
# in the main data frame, then enter a value of 1 in
# the series HAS_CREDIT_BUREAU_LOANS_OVERDUE at an index
# that is identical to the index of the borrower ID
# in the main data frame.
if borrower_ID in main_data_table_borrower_IDs:
# The index of the borrower's row in the main data table.
borrower_index_main_data_table = dataframe.index[dataframe['SK_ID_CURR'] == borrower_ID].tolist()[0]
# Place a value of 1 at the index of the series HAS_CREDIT_BUREAU_LOANS_OVERDUE
# which corresponds to the index of the borrower's ID in the main data table.
HAS_CREDIT_BUREAU_LOANS_OVERDUE.loc[borrower_index_main_data_table] = 1
# Append the newly engineered HAS_CREDIT_BUREAU_LOANS_OVERDUE feature to the main dataframe.
dataframe = dataframe.assign(HAS_CREDIT_BUREAU_LOANS_OVERDUE=HAS_CREDIT_BUREAU_LOANS_OVERDUE.values)
return dataframe
# Build the HAS_CREDIT_BUREAU_LOANS_OVERDUE feature
X_train_raw = build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(X_train_raw)
# Add the new HAS_CREDIT_BUREAU_LOANS_OVERDUE feature to the list of binary categorical
# features. There are now 34 of these features.
bin_cat_feat = bin_cat_feat + ['HAS_CREDIT_BUREAU_LOANS_OVERDUE']
# 9. Use the DAYS_EMPLOYED feature to engineer a binary categorical feature called HAS_JOB.
# If the value of DAYS_EMPLOYED is 0 or less, then HAS_JOB will be 1. Otherwise, HAS_JOB will
# be 0. This condition will apply to all borrowers who had a value of 365243 for DAYS_EMPLOYED,
# which I hypothesized can be best interpreted as meaning that a borrower does not have a job.
DAYS_EMPLOYED_train = X_train_raw['DAYS_EMPLOYED']
HAS_JOB = DAYS_EMPLOYED_train.map(lambda x: 1 if x <= 0 else 0)
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
X_train_raw = X_train_raw.assign(HAS_JOB=HAS_JOB.values)
# 10. Drop the DAYS_EMPLOYED feature from the main dataframe
X_train_raw = X_train_raw.drop('DAYS_EMPLOYED', axis=1)
# Add the new HAS_JOB feature to the list of binary categorical features.
# There are now 35 of these features.
bin_cat_feat = bin_cat_feat + ['HAS_JOB']
# 11. Translate the 2 non-normalized numerical features that have skewed distributions
# and negative values: DAYS_REGISTRATION, and DAYS_LAST_PHONE_CHANGE
def translate_negative_valued_features(dataframe, feature_name_list):
"""
Translate a dataset's continuous features containing several negative
values. The dataframe is modified such that all values of each feature
listed in the feature_name_list parameter become positive.
Parameters:
dataframe: Pandas dataframe containing the features
feature_name_list: List of strings, containing the names
of each feature whose values will be
translated
"""
for feature in feature_name_list:
# The minimum, most-negative, value of the feature
feature_min_value = dataframe[feature].min()
# Translate each value of the feature in a positive direction,
# of magnitude that's equal to the feature's most negative value.
dataframe[feature] = dataframe[feature].apply(lambda x: x - feature_min_value)
# Translate the above two negatively-valued features to positive values
translate_negative_valued_features(X_train_raw, non_norm_feat_neg_values_skewed)
# 12. Log-transform all 17 non-normalized numerical features that have skewed distributions.
# These 17 features include the 2 that were translated to positive ranges in Step 11.
# Add the 2 features translated to positive ranges above in Step 11 to
# the list of non-normalized skewed features with positive values. This is
# the set of features that will be log-transformed
log_transform_feats = non_norm_feat_pos_values_skewed + non_norm_feat_neg_values_skewed
X_train_raw[log_transform_feats] = X_train_raw[log_transform_feats].apply(lambda x: np.log(x + 1))
# 13. Replace 'NaN' values for all numerical features with each feature's mean. Fit an imputer
# to each numerical feature containing at least one 'NaN' entry.
# Create a list of all the 67 numerical features in the main dataframe. These include all
# 17 features that were log-transformed in Step 12, as well as the 4 normal features that
# still need to be scaled, as well as the 46 normal features that don't need scaling.
numerical_features = log_transform_feats + norm_feat_need_scaling + norm_feat_not_need_scaling
# Create a list of all numerical features in the training set that have at least one 'NaN' entry
numerical_features_with_nan = X_train_raw[numerical_features].columns[X_train_raw[numerical_features].isna().any()].tolist()
# Create an imputer
imputer = Imputer()
# Fit the imputer to each numerical feature in the training set that has 'NaN' values,
# and replace each 'NaN' entry of each feature with that feature's mean.
X_train_raw[numerical_features_with_nan] = imputer.fit_transform(X_train_raw[numerical_features_with_nan])
# 14. Remove the borrower ID column, SK_ID_CURR, from the main dataframe
X_train_raw = X_train_raw.drop('SK_ID_CURR', axis=1)
# 15. One-hot encode all 19 non-binary categorical features.
X_train_raw = pd.get_dummies(X_train_raw, columns=cat_feat_need_one_hot)
# Create a list that includes only the newly one-hot encoded features
# as well as all the categorical features that were already binary.
all_bin_cat_feat = X_train_raw.columns.tolist()
for column_name in X_train_raw[numerical_features].columns.tolist():
all_bin_cat_feat.remove(column_name)
# 16. Replace all 'NaN' values in all binary categorical features with 0.
# Create a list of binary categorical features with at least one 'NaN' entry
bin_cat_feat_with_nan = X_train_raw[all_bin_cat_feat].columns[X_train_raw[all_bin_cat_feat].isna().any()].tolist()
# Replace each 'NaN' value in each of these binary features with 0
X_train_raw[bin_cat_feat_with_nan] = X_train_raw[bin_cat_feat_with_nan].fillna(value=0)
# 17. Fit a min-max scaler to each of the 17 log-transformed numerical features, as well
# as to the 4 features DAYS_BIRTH, DAYS_ID_PUBLISH, HOUR_APPR_PROCESS_START, and the normalized
# feature REGION_POPULATION_RELATIVE. Each feature will be scaled to a range [0.0, 1.0].
# Build a list of all 21 features needing scaling. Add the list of features that
# were log-normalized above in Step 12 to the list of normally shaped features
# that need to be scaled to the range [0,1].
feats_to_scale = norm_feat_need_scaling + log_transform_feats
# Initialize a scaler with the default range of [0,1]
scaler = MinMaxScaler()
# Fit the scaler to each of the features of the train set that need to be scaled,
# then transform each of these features' values to the new scale.
X_train_raw[feats_to_scale] = scaler.fit_transform(X_train_raw[feats_to_scale])
# Rename the dataframe to indicate that its columns have been fully preprocessed.
X_train_processed = X_train_raw
# 18. Fit selectKBest to the fully processed full feature set.
selectK = SelectKBest(score_func=f_classif, k=10)
selectK.fit(X_train_processed, y_train)
# Rank each feature by its score in SelectKBest and display
feature_list = X_train_processed.columns.values
feature_importance_list = selectK.scores_
rows_list = []
for i in range(len(feature_importance_list)):
dictionary = {}
dictionary['Feature Name'] = feature_list[i]
dictionary['Score'] = feature_importance_list[i]
rows_list.append(dictionary)
selectKBest_feature_scores = pd.DataFrame(rows_list, columns=['Feature Name', 'Score'])
selectKBest_feature_scores_ranked = selectKBest_feature_scores.sort_values('Score',ascending=False)
# Ranked scores of each feature using SelectKBest with a
# f_classif scorer.
display(selectKBest_feature_scores_ranked)
# Select the top k=30 features according to their f_classif scores.
selectK_top_features = selectKBest_feature_scores_ranked['Feature Name'].values[:30]
print(selectK_top_features)
# Determine what number of features have an aggregate f_classif score that
# comprises 90% of the aggregate f_classif of all features.
aggregate_f_classif_score_all_features = selectKBest_feature_scores_ranked['Score'].sum()
aggregate_f_classif_score_top_30_features = selectKBest_feature_scores_ranked[:30]['Score'].sum()
print('Aggregate f_classif score of all 251 features: {}'.format(aggregate_f_classif_score_all_features))
print('Aggregate f_classif score of top 30 features: {}'.format(aggregate_f_classif_score_all_features))
print('Top 30 features\' total score is {}% of the total score of all 251 features.'.format(round(aggregate_f_classif_score_top_30_features*100./aggregate_f_classif_score_all_features,2)))
# Reduce the training dataset to the top 30 features:
X_train_final = X_train_processed[selectK_top_features]
display(X_train_final)
# 21. Build a data preprocessing pipeline to used for all testing sets.
# This pipeline will recreate all features that were engineered in the
# training set during the original data preprocessing phase.
# The pipeline will also apply the imputer, min-max, and PCA transforms
# originally fit on features in the training set to all datapoints in a
# testing set.
def adjust_columns_application_test_csv_table(testing_dataframe):
"""
After it is one-hot encoded, application_test.csv data table will have one
extra column, 'REGION_RATING_CLIENT_W_CITY_-1', that is not present in the
training dataframe. This column will be removed from the testing datatable
in this case. Only 1 of the 48,744 rows in application_test.csv will have a
value of 1 for this feature following one-hot encoding. I am not worried
about this column's elimination from the testing dataframe affecting predictions.
Additionally, unlike the test validation set, which originally comprised 20% of
application_train.csv, application_test.csv will be missing the following columns
after it is one-hot encoded:
'CODE_GENDER_XNA', 'NAME_INCOME_TYPE_Maternity leave', 'NAME_FAMILY_STATUS_Unknown'
In this case, we need to insert these columns into the testing dataframe, at
the exact same indices they are located at in the fully preprocessed training
dataframe.Each inserted column will be filled with all zeros. (If each of these
binary features are missing from the application_test.csv data table, we can infer
that each borrower in thatdata table obviously would have a 0 for each feature were
it present.)
Parameters:
testing_dataframe: Pandas dataframe containing the testing dataset
contained in the file application_test.csv
Returns: a testing dataframe containing the exact same columns and
column order as found in the training dataframe
"""
# Identify any columns in the one-hot encoded testing_dataframe that
# are not in X_train_raw. These columns will need to be removed from the
# testing_dataframe. (Expected that there will only be one such
# column: 'REGION_RATING_CLIENT_W_CITY_-1')
X_train_columns_list = X_train_raw.columns.tolist()
testing_dataframe_columns_list = testing_dataframe.columns.tolist()
for column_name in X_train_columns_list:
if column_name in testing_dataframe_columns_list:
testing_dataframe_columns_list.remove(column_name)
columns_not_in_X_train_raw = testing_dataframe_columns_list
# Drop any column from the testing_dataframe that is not in the
# training dataframe. Expected to only be the one column 'REGION_RATING_CLIENT_W_CITY_-1'
for column in columns_not_in_X_train_raw:
testing_dataframe = testing_dataframe.drop(column, axis=1)
# Get the column indices of each of the features 'CODE_GENDER_XNA',
#'NAME_INCOME_TYPE_Maternity leave', 'NAME_FAMILY_STATUS_Unknown' from
# the raw training dataframe (X_train_raw) prior to having having PCA run on it.
loc_code_gender_training_frame = X_train_raw.columns.get_loc('CODE_GENDER_XNA')
loc_name_income_type_maternity_leave_training_frame = X_train_raw.columns.get_loc('NAME_INCOME_TYPE_Maternity leave')
loc_name_family_status_unknown_training_frame = X_train_raw.columns.get_loc('NAME_FAMILY_STATUS_Unknown')
# Insert each column into the testing dataframe at the same index it had
# in the X_train_raw dataframe before PCA was run. Fill each column with all 0s.
# Order is important. 'CODE_GENDER_XNA' should be inserted first, followed by
# 'NAME_INCOME_TYPE_Maternity leave', and then finally 'NAME_FAMILY_STATUS_Unknown'.
testing_dataframe.insert(loc=loc_code_gender_training_frame, column='CODE_GENDER_XNA', value=0)
testing_dataframe.insert(loc=loc_name_income_type_maternity_leave_training_frame, column='NAME_INCOME_TYPE_Maternity leave', value=0)
testing_dataframe.insert(loc=loc_name_family_status_unknown_training_frame, column='NAME_FAMILY_STATUS_Unknown', value=0)
return testing_dataframe
def test_set_preprocessing_pipeline(testing_dataframe):
"""
Recreate all features that were engineered in the training set during
the original data preprocessing phase. Missing numerical 'NaN' values
will be filled with an imputer. Missing binary categorical feature 'NaN'
values will be replaced with 0. The pipeline will also apply
the min-max and PCA transforms originally fit on features
in the training set to numerical features in the testing set.
Parameters:
testing_dataframe: Pandas dataframe containing a testing dataset
Returns: a fully preprocessed testing dataframe
"""
# Create the HAS_CHILDREN feature.
CNT_CHILDREN_test = testing_dataframe['CNT_CHILDREN']
HAS_CHILDREN = CNT_CHILDREN_test.map(lambda x: 1 if x > 0 else 0)
# Append the newly engineered HAS_CHILDREN feature to the main dataframe.
testing_dataframe = testing_dataframe.assign(HAS_CHILDREN=HAS_CHILDREN.values)
# Drop the CNT_CHILDREN column from the main dataframe
testing_dataframe = testing_dataframe.drop('CNT_CHILDREN', axis=1)
# Create the NUMBER_FAMILY_MEMBERS feature.
CNT_FAM_MEMBERS_test = testing_dataframe['CNT_FAM_MEMBERS']
NUMBER_FAMILY_MEMBERS = CNT_FAM_MEMBERS_test.map(lambda x: 'one' if x == 1 else ('two' if x == 2 else 'three_plus'))
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
testing_dataframe = testing_dataframe.assign(NUMBER_FAMILY_MEMBERS=NUMBER_FAMILY_MEMBERS.values)
# Drop the CNT_FAM_MEMBERS feature from the main dataframe
testing_dataframe = testing_dataframe.drop('CNT_FAM_MEMBERS', axis=1)
# Build the HAS_CREDIT_BUREAU_LOANS_OVERDUE feature
testing_dataframe = build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(testing_dataframe)
# Create the HAS_JOB feature
DAYS_EMPLOYED_test = testing_dataframe['DAYS_EMPLOYED']
HAS_JOB = DAYS_EMPLOYED_test.map(lambda x: 1 if x <= 0 else 0)
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
testing_dataframe = testing_dataframe.assign(HAS_JOB=HAS_JOB.values)
# Drop the DAYS_EMPLOYED feature from the main dataframe
testing_dataframe = testing_dataframe.drop('DAYS_EMPLOYED', axis=1)
# Translate the two negatively-valued features DAYS_REGISTRATION, and
# DAYS_LAST_PHONE_CHANGE to positive values
translate_negative_valued_features(testing_dataframe, non_norm_feat_neg_values_skewed)
# Log-transform all 17 non-normalized numerical features that have skewed distributions.
testing_dataframe[log_transform_feats] = testing_dataframe[log_transform_feats].apply(lambda x: np.log(x + 1))
# Create a list of all numerical features in the testing dataframe that have at least one 'NaN' entry
numerical_features_with_nan_testing = testing_dataframe[numerical_features].columns[testing_dataframe[numerical_features].isna().any()].tolist()
# Use an imputer to replace 'NaN' values for all numerical features with each feature's mean.
testing_dataframe[numerical_features_with_nan_testing] = imputer.fit_transform(testing_dataframe[numerical_features_with_nan_testing])
# Remove the borrower ID column, SK_ID_CURR, from the main dataframe
testing_dataframe = testing_dataframe.drop('SK_ID_CURR', axis=1)
# One-hot encode all 19 non-binary categorical features.
testing_dataframe = pd.get_dummies(testing_dataframe, columns=cat_feat_need_one_hot)
# After one-hot encoding, the testing dataframe from application_test.csv will be
# missing 2 columns that are in the training dataframe. It will also have an extra
# column that was not in the training dataframe, giving it 249 total columns.
# If this is the case, we need to modify this testing dataframe so that its columns
# and column order is consistent with the training dataframe.
if testing_dataframe.shape[1] == 249:
testing_dataframe = adjust_columns_application_test_csv_table(testing_dataframe)
# Create a list of the binary categorical features with at least one 'NaN' entry
bin_cat_feat_with_nan_testing = testing_dataframe[all_bin_cat_feat].columns[testing_dataframe[all_bin_cat_feat].isna().any()].tolist()
# Replace each 'NaN' value in each of these binary features with 0
testing_dataframe[bin_cat_feat_with_nan_testing] = testing_dataframe[bin_cat_feat_with_nan_testing].fillna(value=0)
# Transform each of the 21 features that need to be scaled to the range [0,1] using
# the min-max scaler fit on the training set.
testing_dataframe[feats_to_scale] = scaler.transform(testing_dataframe[feats_to_scale])
# Reduce the testing dataframe to the top selectKbest features:
testing_dataframe = testing_dataframe[selectK_top_features]
return testing_dataframe
# 22. Preprocess the test validation set.
X_test_final = test_set_preprocessing_pipeline(X_test_raw)
# Verify that both the training and test validation dataframes have the expected number of columns after
# preprocessing its data and reducing their featurespace to the top 30 features returned by SelectKBest.
print('Training set preprocessing complete. The final training dataframe now has {} columns. Expected: 30.'.format(X_train_final.shape[1]))
print('Test validation set preprocessing complete. The final test validation dataframe now has {} columns. Expected: 30.'.format(X_test_final.shape[1]))
# Train the classifiers and compute prediction probabilities.
# 1. Use a Gaussian Naive Bayes classifier to make predictions on
# the test validation set. Calculate the area under ROC curve score of
# these predictions.
# Fit a Gaussian Naive Bayes classifier to the training dataframe.
clf_naive_bayes = GaussianNB()
clf_naive_bayes.fit(X_train_final, y_train)
# The Naive Bayes estimates of probability of the positive class (TARGET=1):
# the probability estimate of each borrower making at least one late loan payment.
naive_bayes_selectKbest_y_score = clf_naive_bayes.predict_proba(X_test_final)[:, 1]
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
naive_bayes_selectKbest_roc_auc_score = roc_auc_scorer(y_test, naive_bayes_selectKbest_y_score)
# Add the Naive Bayes classifier's scores to the results list.
y_score_list.append(naive_bayes_selectKbest_y_score)
clf_label_list.append('Naive Bayes SelectKBest, K=30')
print('Naive Bayes (SelectKBest) test validation set predictions\' ROC AUC score: {}'.format(naive_bayes_selectKbest_roc_auc_score))
# 2. Use an AdaBoost classifier to make predictions on the test validation set.
# Calculate the area under ROC curve score of these predictions.
# Fit the AdaBoost classifier, using the parameter for 'n_estimators' discovered
# when running GridSearchCV on AdaBoost above for the full featureset.
clf_AdaBoost = AdaBoostClassifier(learning_rate=1.0, n_estimators=1000, random_state=42)
clf_AdaBoost.fit(X_train_final, y_train)
# The AdaBoost classifier's estimates of probability of the positive class (TARGET=1):
# the probability estimate of each borrower making at least one late loan payment.
adaBoost_selectKbest_y_score = clf_AdaBoost.predict_proba(X_test_final)[:, 1]
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
adaBoost_selectKbest_roc_auc_score = roc_auc_scorer(y_test, adaBoost_selectKbest_y_score)
# Add the AdaBoost classifier's scores to the results list.
y_score_list.append(adaBoost_selectKbest_y_score)
clf_label_list.append('AdaBoost SelectKBest, K=30')
print('AdaBoost (SelectKBest) test validation set predictions\' ROC AUC score: {}'.format(adaBoost_selectKbest_roc_auc_score))
# 3. Try using a Logistic Regression classifier to make predictions.
# Fit the classifier to the training data.
clf_logistic_regression = LogisticRegression(penalty='l2', random_state=42, solver='liblinear', max_iter=100)
clf_logistic_regression.fit(X_train_final, y_train)
# The logistical regression classifier's estimates of probability of the positive
# class (TARGET=1): the probability estimate of each borrower making at least one
# late loan payment.
logistic_regression_selectKbest_y_score = clf_logistic_regression.predict_proba(X_test_final)[:, 1]
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
logistic_regression_selectKbest_roc_auc_score = roc_auc_scorer(y_test, logistic_regression_selectKbest_y_score)
# Add the Logistic Regression classifier's scores to the results list.
y_score_list.append(logistic_regression_selectKbest_y_score)
clf_label_list.append('Logistic Regression SelectKBest, K=30')
print('Logistic Regression (SelectKBest) test validation set predictions\' ROC AUC score: {}'.format(logistic_regression_selectKbest_roc_auc_score))
# 4. Try using a Multi-layer Perceptron classifier to make predictions.
# Fit the classifier to the training data.
clf_mlp = MLPClassifier(
hidden_layer_sizes=100, activation='identity', solver='adam', alpha=0.001, batch_size=200,
learning_rate_init=0.001, random_state=42
)
clf_mlp.fit(X_train_final, y_train)
# The multi-layer perceptron classifier's estimates of probability of the positive
# class (TARGET=1): the probability estimate of each borrower making at least one
# late loan payment.
mlp_selectKbest_y_score = clf_mlp.predict_proba(X_test_final)[:, 1]
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
mlp_selectKbest_roc_auc_score = roc_auc_scorer(y_test, mlp_selectKbest_y_score)
# Add the Multi-Layer Perceptron classifier's scores to the results list.
y_score_list.append(mlp_selectKbest_y_score)
clf_label_list.append('Multi-Layer Perceptron SelectKBest, K=30')
print('Multi-layer Perceptron (SelectKBest) test validation set predictions\' ROC AUC score: {}'.format(mlp_selectKbest_roc_auc_score))
# 5. Try using a LightGBM classifier.
# Convert preprocessed training dataset into LightGBM dataset format
lightgbm_training = lgb.Dataset(X_train_final, label=y_train)
# Specify parameters
params = {}
params['learning_rate'] = 0.01
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'auc'
params['sub_feature'] = 0.3
params['num_leaves'] = 100
params['min_data_in_leaf'] = 500
params['max_depth'] = 10
params['max_bin'] = 64
#params['min_data_in_bin'] = 3
#params['lambda_l1'] = 0.01
params['lambda_l2'] = 0.01
#params['min_gain_to_split'] = 0.01
params['bagging_freq'] = 100
params['bagging_fraction'] = 0.9
#params['feature_fraction'] = 0.5
# Fit the LightGBM classifier to the training data
clf_lgb = lgb.train(params, lightgbm_training, 1500)
# Classifier's estimates of probability of the positive class (TARGET=1): the
# probability estimate of each borrower making at least one late loan payment.
lgb_selectKbest_y_score = clf_lgb.predict(X_test_final)
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
lgb_selectKbest_roc_auc_score = roc_auc_scorer(y_test, lgb_selectKbest_y_score)
# Add the LightGBM classifier's scores to the results list.
y_score_list.append(lgb_selectKbest_y_score)
clf_label_list.append('LightGBM SelectKBest, K=30')
print('LightGBM (SelectKBest) test validation set predictions\' ROC AUC score: {}'.format(lgb_selectKbest_roc_auc_score))
# Step 3. Train a LightGBM classifier on the full featureset, and use GridSearchCV
# to develop further intuition on LightGBM parameter tuning.
# Load the main data tables
application_train_data = pd.read_csv("data/application_train.csv")
application_test_data = pd.read_csv("data/application_test.csv")
# Load the Bureau data table
bureau_data = pd.read_csv("data/bureau.csv")
# 1: Create lists of different feature types in the main data
# frame, based on how each type will need to be preprocessed.
# i. All 18 categorical features needing one-hot encoding.
# Includes the 4 categorical features originally
# mis-identified as having been normalized:
# EMERGENCYSTATE_MODE, HOUSETYPE_MODE, WALLSMATERIAL_MODE,
# FONDKAPREMONT_MODE
cat_feat_need_one_hot = [
'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
'FLAG_OWN_REALTY', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE',
'NAME_TYPE_SUITE', 'OCCUPATION_TYPE', 'EMERGENCYSTATE_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'FONDKAPREMONT_MODE'
]
# ii. All 32 binary categorical features already one-hot encoded.
bin_cat_feat = [
'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'
]
# iii. All 2 non-normalized numerical features with skewed distributions
# and negative values. These features will need to have their
# distributions translated to positive ranges before being
# log-transformed, and then later scaled to the range [0,1].
non_norm_feat_neg_values_skewed = [
'DAYS_REGISTRATION', 'DAYS_LAST_PHONE_CHANGE'
]
# iv. All 15 non-normalized numerical features with skewed distributions,
# and only positive values. These features will need to be
# log-transformed, and eventually scaled to the range [0,1].
non_norm_feat_pos_values_skewed = [
'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
'AMT_GOODS_PRICE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'OWN_CAR_AGE'
]
# v. All 4 numerical features with normal shapes but needing to be scaled
# to the range [0,1].
norm_feat_need_scaling = [
'DAYS_BIRTH', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START',
'REGION_POPULATION_RELATIVE'
]
# vi. All 46 numerical features that have been normalized to the range
# [0,1]. These features will need neither log-transformation, nor
# any further scaling.
norm_feat_not_need_scaling = [
'EXT_SOURCE_2', 'EXT_SOURCE_3', 'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BEGINEXPLUATATION_MEDI', 'FLOORSMAX_AVG',
'FLOORSMAX_MODE', 'FLOORSMAX_MEDI', 'LIVINGAREA_AVG',
'LIVINGAREA_MODE', 'LIVINGAREA_MEDI', 'ENTRANCES_AVG',
'ENTRANCES_MODE', 'ENTRANCES_MEDI', 'APARTMENTS_AVG',
'APARTMENTS_MODE', 'APARTMENTS_MEDI', 'ELEVATORS_AVG',
'ELEVATORS_MODE', 'ELEVATORS_MEDI', 'NONLIVINGAREA_AVG',
'NONLIVINGAREA_MODE', 'NONLIVINGAREA_MEDI', 'EXT_SOURCE_1',
'BASEMENTAREA_AVG', 'BASEMENTAREA_MODE', 'BASEMENTAREA_MEDI',
'LANDAREA_AVG', 'LANDAREA_MODE', 'LANDAREA_MEDI',
'YEARS_BUILD_AVG', 'YEARS_BUILD_MODE', 'YEARS_BUILD_MEDI',
'FLOORSMIN_AVG', 'FLOORSMIN_MODE', 'FLOORSMIN_MEDI',
'LIVINGAPARTMENTS_AVG', 'LIVINGAPARTMENTS_MODE', 'LIVINGAPARTMENTS_MEDI',
'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_MEDI',
'COMMONAREA_AVG', 'COMMONAREA_MODE', 'COMMONAREA_MEDI',
'TOTALAREA_MODE'
]
# vii. The remaining 3 features in the main data frame that will be
# re-engineered and transformed into different features
feat_to_be_reengineered = [
'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'DAYS_EMPLOYED'
]
# 2: Separate target data from training dataset.
targets = application_train_data['TARGET']
features_raw = application_train_data.drop('TARGET', axis = 1)
# 3: Use train_test_split from sklearn.cross_validation to
# create a test validation set that is 20% of the size of the total training set:
# Will allow me to compare performance of various learning algorithms without
# overfitting to the training data.
X_train_raw, X_test_raw, y_train, y_test = train_test_split(features_raw,
targets,
test_size = 0.2,
random_state = 42)
# 4: Use the CNT_CHILDREN feature to engineer a binary
# categorical feature called HAS_CHILDREN. If value of CNT_CHILDREN is
# greater than 0, the value of HAS_CHILDREN will be 1. If value of CNT_CHILDREN is
# 0, value of HAS_CHILDREN will be 0.
CNT_CHILDREN_train = X_train_raw['CNT_CHILDREN']
HAS_CHILDREN = CNT_CHILDREN_train.map(lambda x: 1 if x > 0 else 0)
# Append the newly engineered HAS_CHILDREN feature to the main dataframe.
X_train_raw = X_train_raw.assign(HAS_CHILDREN=HAS_CHILDREN.values)
# 5: Drop the CNT_CHILDREN column from the main dataframe
X_train_raw = X_train_raw.drop('CNT_CHILDREN', axis=1)
# Add the new HAS_CHILDREN feature to the list of binary categorical
# features that are already one-hot encoded. There are now 33 such features.
bin_cat_feat = bin_cat_feat + ['HAS_CHILDREN']
# 6. Use the CNT_FAM_MEMBERS feature to engineer a categorical feature called NUMBER_FAMILY_MEMBERS.
# If CNT_FAM_MEMBERS is 1.0, then the value of NUMBER_FAMILY_MEMBERS will be 'one'. If CNT_FAM_MEMBERS is 2.0,
# then NUMBER_FAMILY_MEMBERS will be 'two'. If CNT_FAM_MEMBERS is 3.0 or greater, then NUMBER_FAMILY_MEMBERS will
# be 'three_plus'.
CNT_FAM_MEMBERS_train = X_train_raw['CNT_FAM_MEMBERS']
NUMBER_FAMILY_MEMBERS = CNT_FAM_MEMBERS_train.map(lambda x: 'one' if x == 1 else ('two' if x == 2 else 'three_plus'))
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
X_train_raw = X_train_raw.assign(NUMBER_FAMILY_MEMBERS=NUMBER_FAMILY_MEMBERS.values)
# 7. Drop the CNT_FAM_MEMBERS feature from the main dataframe
X_train_raw = X_train_raw.drop('CNT_FAM_MEMBERS', axis=1)
# Add the new NUMBER_FAMILY_MEMBERS feature to the list of categorical
# features that will need to be one-hot encoded. There are now 19 of these features.
cat_feat_need_one_hot = cat_feat_need_one_hot + ['NUMBER_FAMILY_MEMBERS']
# 8. Use the CREDIT_DAY_OVERDUE feature in bureau.csv to engineer the binary
# categorical HAS_CREDIT_BUREAU_LOANS_OVERDUE feature. If CREDIT_DAY_OVERDUE for a
# particular borrower ID (SK_ID_CURR) is greater than 0, then the value of
# HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 1. If CREDIT_DAY_OVERDUE for a particular
# borrower ID is 0, then the value of HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 0.
# Filter the bureau data table for loans which are overdue (have a value
# for CREDIT_DAY_OVERDUE that's greater than 0)
bureau_data_filtered_for_overdue = bureau_data[bureau_data['CREDIT_DAY_OVERDUE'] > 0]
def build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(dataframe):
"""
Use the CREDIT_DAY_OVERDUE feature in bureau.csv to engineer the binary
categorical HAS_CREDIT_BUREAU_LOANS_OVERDUE feature. If CREDIT_DAY_OVERDUE for a
particular borrower ID (SK_ID_CURR) is greater than 0, then the value of
HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 1. If CREDIT_DAY_OVERDUE for a particular
borrower ID is 0, then the value of HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 0.
Parameters:
dataframe: Pandas dataframe containing a training or testing dataset
Returns: The dataframe with HAS_CREDIT_BUREAU_LOANS_OVERDUE feature appended to it.
"""
# Create a series called HAS_CREDIT_BUREAU_LOANS_OVERDUE and fill it with zeros.
# Its index is identical to that of the main dataframe. It will eventually be appended
# to the main data frame as a column.
HAS_CREDIT_BUREAU_LOANS_OVERDUE = pd.Series(data=0, index = dataframe['SK_ID_CURR'].index)
# A list of all the borrowers IDs in the main dataframe
main_data_table_borrower_IDs = dataframe['SK_ID_CURR'].values
# For each loan in the bureau data table that is overdue
# (has a value for CREDIT_DAY_OVERDUE that's greater than 0)
for index, row in bureau_data_filtered_for_overdue.iterrows():
# The borrower ID (SK_ID_CURR) that owns the overdue loan
borrower_ID = row['SK_ID_CURR']
# If the borrower ID owning the overdue loan is also
# in the main data frame, then enter a value of 1 in
# the series HAS_CREDIT_BUREAU_LOANS_OVERDUE at an index
# that is identical to the index of the borrower ID
# in the main data frame.
if borrower_ID in main_data_table_borrower_IDs:
# The index of the borrower's row in the main data table.
borrower_index_main_data_table = dataframe.index[dataframe['SK_ID_CURR'] == borrower_ID].tolist()[0]
# Place a value of 1 at the index of the series HAS_CREDIT_BUREAU_LOANS_OVERDUE
# which corresponds to the index of the borrower's ID in the main data table.
HAS_CREDIT_BUREAU_LOANS_OVERDUE.loc[borrower_index_main_data_table] = 1
# Append the newly engineered HAS_CREDIT_BUREAU_LOANS_OVERDUE feature to the main dataframe.
dataframe = dataframe.assign(HAS_CREDIT_BUREAU_LOANS_OVERDUE=HAS_CREDIT_BUREAU_LOANS_OVERDUE.values)
return dataframe
# Build the HAS_CREDIT_BUREAU_LOANS_OVERDUE feature
X_train_raw = build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(X_train_raw)
# Add the new HAS_CREDIT_BUREAU_LOANS_OVERDUE feature to the list of binary categorical
# features. There are now 34 of these features.
bin_cat_feat = bin_cat_feat + ['HAS_CREDIT_BUREAU_LOANS_OVERDUE']
# 9. Use the DAYS_EMPLOYED feature to engineer a binary categorical feature called HAS_JOB.
# If the value of DAYS_EMPLOYED is 0 or less, then HAS_JOB will be 1. Otherwise, HAS_JOB will
# be 0. This condition will apply to all borrowers who had a value of 365243 for DAYS_EMPLOYED,
# which I hypothesized can be best interpreted as meaning that a borrower does not have a job.
DAYS_EMPLOYED_train = X_train_raw['DAYS_EMPLOYED']
HAS_JOB = DAYS_EMPLOYED_train.map(lambda x: 1 if x <= 0 else 0)
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
X_train_raw = X_train_raw.assign(HAS_JOB=HAS_JOB.values)
# 10. Drop the DAYS_EMPLOYED feature from the main dataframe
X_train_raw = X_train_raw.drop('DAYS_EMPLOYED', axis=1)
# Add the new HAS_JOB feature to the list of binary categorical features.
# There are now 35 of these features.
bin_cat_feat = bin_cat_feat + ['HAS_JOB']
# 11. Translate the 2 non-normalized numerical features that have skewed distributions
# and negative values: DAYS_REGISTRATION, and DAYS_LAST_PHONE_CHANGE
def translate_negative_valued_features(dataframe, feature_name_list):
"""
Translate a dataset's continuous features containing several negative
values. The dataframe is modified such that all values of each feature
listed in the feature_name_list parameter become positive.
Parameters:
dataframe: Pandas dataframe containing the features
feature_name_list: List of strings, containing the names
of each feature whose values will be
translated
"""
for feature in feature_name_list:
# The minimum, most-negative, value of the feature
feature_min_value = dataframe[feature].min()
# Translate each value of the feature in a positive direction,
# of magnitude that's equal to the feature's most negative value.
dataframe[feature] = dataframe[feature].apply(lambda x: x - feature_min_value)
# Translate the above two negatively-valued features to positive values
translate_negative_valued_features(X_train_raw, non_norm_feat_neg_values_skewed)
# 12. Log-transform all 17 non-normalized numerical features that have skewed distributions.
# These 17 features include the 2 that were translated to positive ranges in Step 11.
# Add the 2 features translated to positive ranges above in Step 11 to
# the list of non-normalized skewed features with positive values. This is
# the set of features that will be log-transformed
log_transform_feats = non_norm_feat_pos_values_skewed + non_norm_feat_neg_values_skewed
X_train_raw[log_transform_feats] = X_train_raw[log_transform_feats].apply(lambda x: np.log(x + 1))
# 13. Replace 'NaN' values for all numerical features with each feature's mean. Fit an imputer
# to each numerical feature containing at least one 'NaN' entry.
# Create a list of all the 67 numerical features in the main dataframe. These include all
# 17 features that were log-transformed in Step 12, as well as the 4 normal features that
# still need to be scaled, as well as the 46 normal features that don't need scaling.
numerical_features = log_transform_feats + norm_feat_need_scaling + norm_feat_not_need_scaling
# Create a list of all numerical features in the training set that have at least one 'NaN' entry
numerical_features_with_nan = X_train_raw[numerical_features].columns[X_train_raw[numerical_features].isna().any()].tolist()
# Create an imputer
imputer = Imputer()
# Fit the imputer to each numerical feature in the training set that has 'NaN' values,
# and replace each 'NaN' entry of each feature with that feature's mean.
X_train_raw[numerical_features_with_nan] = imputer.fit_transform(X_train_raw[numerical_features_with_nan])
# 14. Remove the borrower ID column, SK_ID_CURR, from the main dataframe
X_train_raw = X_train_raw.drop('SK_ID_CURR', axis=1)
# 15. One-hot encode all 19 non-binary categorical features.
X_train_raw = pd.get_dummies(X_train_raw, columns=cat_feat_need_one_hot)
# Create a list that includes only the newly one-hot encoded features
# as well as all the categorical features that were already binary.
all_bin_cat_feat = X_train_raw.columns.tolist()
for column_name in X_train_raw[numerical_features].columns.tolist():
all_bin_cat_feat.remove(column_name)
# 16. Replace all 'NaN' values in all binary categorical features with 0.
# Create a list of binary categorical features with at least one 'NaN' entry
bin_cat_feat_with_nan = X_train_raw[all_bin_cat_feat].columns[X_train_raw[all_bin_cat_feat].isna().any()].tolist()
# Replace each 'NaN' value in each of these binary features with 0
X_train_raw[bin_cat_feat_with_nan] = X_train_raw[bin_cat_feat_with_nan].fillna(value=0)
# 17. Fit a min-max scaler to each of the 17 log-transformed numerical features, as well
# as to the 4 features DAYS_BIRTH, DAYS_ID_PUBLISH, HOUR_APPR_PROCESS_START, and the normalized
# feature REGION_POPULATION_RELATIVE. Each feature will be scaled to a range [0.0, 1.0].
# Build a list of all 21 features needing scaling. Add the list of features that
# were log-normalized above in Step 12 to the list of normally shaped features
# that need to be scaled to the range [0,1].
feats_to_scale = norm_feat_need_scaling + log_transform_feats
# Initialize a scaler with the default range of [0,1]
scaler = MinMaxScaler()
# Fit the scaler to each of the features of the train set that need to be scaled,
# then transform each of these features' values to the new scale.
X_train_raw[feats_to_scale] = scaler.fit_transform(X_train_raw[feats_to_scale])
# Rename the dataframe to indicate that its columns have been fully preprocessed.
X_train_final = X_train_raw
# 18. Build a data preprocessing pipeline to used for all testing sets.
# This pipeline will recreate all features that were engineered in the
# training set during the original data preprocessing phase.
# The pipeline will also apply the imputer, min-max, and PCA transforms
# originally fit on features in the training set to all datapoints in a
# testing set.
def adjust_columns_application_test_csv_table(testing_dataframe):
"""
After it is one-hot encoded, application_test.csv data table will have one
extra column, 'REGION_RATING_CLIENT_W_CITY_-1', that is not present in the
training dataframe. This column will be removed from the testing datatable
in this case. Only 1 of the 48,744 rows in application_test.csv will have a
value of 1 for this feature following one-hot encoding. I am not worried
about this column's elimination from the testing dataframe affecting predictions.
Additionally, unlike the test validation set, which originally comprised 20% of
application_train.csv, application_test.csv will be missing the following columns
after it is one-hot encoded:
'CODE_GENDER_XNA', 'NAME_INCOME_TYPE_Maternity leave', 'NAME_FAMILY_STATUS_Unknown'
In this case, we need to insert these columns into the testing dataframe, at
the exact same indices they are located at in the fully preprocessed training
dataframe.Each inserted column will be filled with all zeros. (If each of these
binary features are missing from the application_test.csv data table, we can infer
that each borrower in thatdata table obviously would have a 0 for each feature were
it present.)
Parameters:
testing_dataframe: Pandas dataframe containing the testing dataset
contained in the file application_test.csv
Returns: a testing dataframe containing the exact same columns and
column order as found in the training dataframe
"""
# Identify any columns in the one-hot encoded testing_dataframe that
# are not in X_train_raw. These columns will need to be removed from the
# testing_dataframe. (Expected that there will only be one such
# column: 'REGION_RATING_CLIENT_W_CITY_-1')
X_train_columns_list = X_train_raw.columns.tolist()
testing_dataframe_columns_list = testing_dataframe.columns.tolist()
for column_name in X_train_columns_list:
if column_name in testing_dataframe_columns_list:
testing_dataframe_columns_list.remove(column_name)
columns_not_in_X_train_raw = testing_dataframe_columns_list
# Drop any column from the testing_dataframe that is not in the
# training dataframe. Expected to only be the one column 'REGION_RATING_CLIENT_W_CITY_-1'
for column in columns_not_in_X_train_raw:
testing_dataframe = testing_dataframe.drop(column, axis=1)
# Get the column indices of each of the features 'CODE_GENDER_XNA',
#'NAME_INCOME_TYPE_Maternity leave', 'NAME_FAMILY_STATUS_Unknown' from
# the raw training dataframe (X_train_raw) prior to having having PCA run on it.
loc_code_gender_training_frame = X_train_raw.columns.get_loc('CODE_GENDER_XNA')
loc_name_income_type_maternity_leave_training_frame = X_train_raw.columns.get_loc('NAME_INCOME_TYPE_Maternity leave')
loc_name_family_status_unknown_training_frame = X_train_raw.columns.get_loc('NAME_FAMILY_STATUS_Unknown')
# Insert each column into the testing dataframe at the same index it had
# in the X_train_raw dataframe before PCA was run. Fill each column with all 0s.
# Order is important. 'CODE_GENDER_XNA' should be inserted first, followed by
# 'NAME_INCOME_TYPE_Maternity leave', and then finally 'NAME_FAMILY_STATUS_Unknown'.
testing_dataframe.insert(loc=loc_code_gender_training_frame, column='CODE_GENDER_XNA', value=0)
testing_dataframe.insert(loc=loc_name_income_type_maternity_leave_training_frame, column='NAME_INCOME_TYPE_Maternity leave', value=0)
testing_dataframe.insert(loc=loc_name_family_status_unknown_training_frame, column='NAME_FAMILY_STATUS_Unknown', value=0)
return testing_dataframe
def test_set_preprocessing_pipeline(testing_dataframe):
"""
Recreate all features that were engineered in the training set during
the original data preprocessing phase. The pipeline will also apply
an imputer to the test data table fill 'NaN' values. Binary feature's 'Nan'
values will be filled with 0. The min-max scaler fit on features
in the training set will be applied to the numerical features in the testing set.
Parameters:
testing_dataframe: Pandas dataframe containing a testing dataset
Returns: a fully preprocessed testing dataframe
"""
# Create the HAS_CHILDREN feature.
CNT_CHILDREN_test = testing_dataframe['CNT_CHILDREN']
HAS_CHILDREN = CNT_CHILDREN_test.map(lambda x: 1 if x > 0 else 0)
# Append the newly engineered HAS_CHILDREN feature to the main dataframe.
testing_dataframe = testing_dataframe.assign(HAS_CHILDREN=HAS_CHILDREN.values)
# Drop the CNT_CHILDREN column from the main dataframe
testing_dataframe = testing_dataframe.drop('CNT_CHILDREN', axis=1)
# Create the NUMBER_FAMILY_MEMBERS feature.
CNT_FAM_MEMBERS_test = testing_dataframe['CNT_FAM_MEMBERS']
NUMBER_FAMILY_MEMBERS = CNT_FAM_MEMBERS_test.map(lambda x: 'one' if x == 1 else ('two' if x == 2 else 'three_plus'))
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
testing_dataframe = testing_dataframe.assign(NUMBER_FAMILY_MEMBERS=NUMBER_FAMILY_MEMBERS.values)
# Drop the CNT_FAM_MEMBERS feature from the main dataframe
testing_dataframe = testing_dataframe.drop('CNT_FAM_MEMBERS', axis=1)
# Build the HAS_CREDIT_BUREAU_LOANS_OVERDUE feature
testing_dataframe = build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(testing_dataframe)
# Create the HAS_JOB feature
DAYS_EMPLOYED_test = testing_dataframe['DAYS_EMPLOYED']
HAS_JOB = DAYS_EMPLOYED_test.map(lambda x: 1 if x <= 0 else 0)
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
testing_dataframe = testing_dataframe.assign(HAS_JOB=HAS_JOB.values)
# Drop the DAYS_EMPLOYED feature from the main dataframe
testing_dataframe = testing_dataframe.drop('DAYS_EMPLOYED', axis=1)
# Translate the two negatively-valued features DAYS_REGISTRATION, and
# DAYS_LAST_PHONE_CHANGE to positive values
translate_negative_valued_features(testing_dataframe, non_norm_feat_neg_values_skewed)
# Log-transform all 17 non-normalized numerical features that have skewed distributions.
testing_dataframe[log_transform_feats] = testing_dataframe[log_transform_feats].apply(lambda x: np.log(x + 1))
# Create a list of all numerical features in the testing dataframe that have at least one 'NaN' entry
numerical_features_with_nan_testing = testing_dataframe[numerical_features].columns[testing_dataframe[numerical_features].isna().any()].tolist()
# Use an imputer to replace 'NaN' values for all numerical features with each feature's mean.
testing_dataframe[numerical_features_with_nan_testing] = imputer.fit_transform(testing_dataframe[numerical_features_with_nan_testing])
# Remove the borrower ID column, SK_ID_CURR, from the main dataframe
testing_dataframe = testing_dataframe.drop('SK_ID_CURR', axis=1)
# One-hot encode all 19 non-binary categorical features.
testing_dataframe = pd.get_dummies(testing_dataframe, columns=cat_feat_need_one_hot)
# After one-hot encoding, the testing dataframe from application_test.csv will be
# missing 2 columns that are in the training dataframe. It will also have an extra
# column that was not in the training dataframe, giving it 249 total columns.
# If this is the case, we need to modify this testing dataframe so that its columns
# and column order is consistent with the training dataframe.
if testing_dataframe.shape[1] == 249:
testing_dataframe = adjust_columns_application_test_csv_table(testing_dataframe)
# Create a list of the binary categorical features with at least one 'NaN' entry
bin_cat_feat_with_nan_testing = testing_dataframe[all_bin_cat_feat].columns[testing_dataframe[all_bin_cat_feat].isna().any()].tolist()
# Replace each 'NaN' value in each of these binary features with 0
testing_dataframe[bin_cat_feat_with_nan_testing] = testing_dataframe[bin_cat_feat_with_nan_testing].fillna(value=0)
# Transform each of the 21 features that need to be scaled to the range [0,1] using
# the min-max scaler fit on the training set.
testing_dataframe[feats_to_scale] = scaler.transform(testing_dataframe[feats_to_scale])
return testing_dataframe
# 19. Preprocess the test validation set.
X_test_final = test_set_preprocessing_pipeline(X_test_raw)
# Verify that both the training and test validation dataframes have the expected number of columns after
# preprocessing its data and reducing their featurespace to the top 30 features returned by SelectKBest.
print('Training set preprocessing complete. The final training dataframe now has {} columns. Expected: 251.'.format(X_train_final.shape[1]))
print('Test validation set preprocessing complete. The final test validation dataframe now has {} columns. Expected: 251.'.format(X_test_final.shape[1]))
# Performs GridSearchCV on a LightGBM classifier learning
# algorithm to gain further intuition to aid in hyperparameter
# tuning.
# # Create cross-validation sets from the training data
# cv_sets = StratifiedKFold(n_splits = 5, random_state = 42)
# Transform 'roc_auc_scorer' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(roc_auc_scorer)
# Create an LightGBM classifier object
clf = lgb.LGBMClassifier(learning_rate = 0.1,
boosting_type = 'gbdt',
objective = 'binary',
metric = 'auc',
sub_feature = 0.3,
num_leaves = 50,
min_data_in_leaf = 500,
max_depth = -1,
max_bin = 100,
lambda_l2 = 0.1,
bagging_freq = 3,
bagging_fraction = 0.9,
random_state = 42,
)
# The parameters to search
grid_params = {
'learning_rate': [0.001, 0.01, 0.1],
'sub_feature': [0.3],
'num_leaves': [200],
'lambda_l2': [0.1],
'min_data_in_leaf': [40],
'max_depth': [-1]
}
# Create a GridSearchCV object.
grid = GridSearchCV(clf, grid_params, scoring_fnc, cv=3)
# Fit the grid search object to the data to compute the optimal model
grid.fit(X_train_final, y_train)
# Print the best parameters found
print('Best hyperparameter combo:\r')
print(grid.best_params_)
print('ROC AUC score of best hyperparameter combo:\r')
print(roc_auc_score(y_test, grid.predict_proba(X_test_final)[:,1]))
print(grid.predict_proba(X_test_final)[:,1])
# Use the LightGBM classifier with parameters discovered in GridSearchCV to
# make predictions on the test validation set. Calculate the area under ROC
# curve score of these predictions.
# After running GridSearchCV above, I observed that:
# 1.
# Convert dataframes to LGB format
lgb_training = lgb.Dataset(X_train_final, y_train)
# Final parameters for LightGBM training
params = {}
params['learning_rate'] = 0.001
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'auc'
params['num_leaves'] = 200
params['max_depth'] = 20
params['max_bin'] = 110
params['lambda_l2'] = 0.1
params['bagging_freq'] = 1
params['bagging_fraction'] = 0.95
params['bagging_seed'] = 1
params['feature_fraction'] = 0.9
params['feature_fraction_seed'] = 1
params['random_state'] = 42
# Fit the LightGBM classifier to the training data
clf_lgb = lgb.train(params, lgb_training, 15000)
# Classifier's estimates of probability of the positive class (TARGET=1): the
# probability estimate of each borrower making at least one late loan payment.
lgb_tuned_y_score = clf_lgb.predict(X_test_final)
# The area under the ROC curve between the true target values and the
# probability estimates of the predicted values.
lgb_tuned_roc_auc_score = roc_auc_scorer(y_test, lgb_tuned_y_score)
# Add the LightGBM classifier's scores to the results list.
#y_score_list.append(lgb_tuned_y_score)
#clf_label_list.append('LightGBM All Features, Further Tuning')
print('LightGBM (All Features, Further Tuning) test validation set predictions\' ROC AUC score: {}'.format(lgb_tuned_roc_auc_score))
# Plot the 50 largest LightGBM feature importances:
plt.figure(figsize = (34,26), dpi=300)
feat_importances = pd.Series(clf_lgb.feature_importance(importance_type='split', iteration=-1), clf_lgb.feature_name())
feat_importances = feat_importances.nlargest(50).sort_values(axis='index', ascending = True)
feat_importances.plot(kind='barh')
plt.title('LightGBM Top 50 Feature Importances', fontsize=24)
plt.xlabel('Feature Importance Value', fontsize=22)
plt.ylabel('Feature Name', fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.savefig('lightGBMFeatureImportances.png')
plt.show()
Unless specifically noted otherwise, each classifier was trained on the 184 binary features and 17 PCA reduced numerical features that comprised all 201 features from the original preprocessed training set.
Certain LightGBM classifiers were trained on various different feature subsets/supersets of this original training set. These classifiers are indicated in the table and plot that follow.
ROC AUC scores indicate performance of classifier probability predictions made for the labels of the test validation set, which is 20% of the size (rows) of the training set contained in the application_train.csv data table:
| Classifier Name | ROC AUC Score |
|---|---|
| Naive Bayes (All Features) | 0.546645662333944 |
| AdaBoost (All Features) | 0.7462758964509755 |
| Logistic Regression (All Features) | 0.7471756350178691 |
| Multi-Layer Perceptron (All Features) | 0.7429017839300756 |
| LightGBM (All Features) | 0.7592132612569703 |
| Naive Bayes (PCA) | 0.5452255614331999 |
| AdaBoost (PCA) | 0.7415669749755673 |
| Logistic Regression (PCA) | 0.743963963781135 |
| Multi-Layer Perceptron (PCA) | 0.7439527449175637 |
| LightGBM (PCA) | 0.7483887050110797 |
| Naive Bayes (SelectKBest Features, K=30) | 0.6748662184461512 |
| AdaBoost (SelectKBest Features, K=30) | 0.7330739254581403 |
| Logistic Regression (SelectKBest Features, K=30) | 0.7367180213600446 |
| Multi-Layer Perceptron (SelectKBest Features, K=30) | 0.7358901049573279 |
| LightGBM (SelectKBest Features, K=30) | 0.7394787228642934 |
| LightGBM (All Features, Further Tuning) | 0.7609160310721934 |
# Display ROC curves of the Naive Bayes, AdaBoost, Logistic Regression, and
# Multi-Layer Perceptron classifiers' probability predictions.
vs.plot_roc_curves(y_test, y_score_list, clf_label_list, title='Receiver Operating Characteristic Curves');
# Train the final LightGBM classifier on the entire training set
# Load the main data tables
application_train_data = pd.read_csv("data/application_train.csv")
application_test_data = pd.read_csv("data/application_test.csv")
# Load the Bureau data table
bureau_data = pd.read_csv("data/bureau.csv")
# 1. Create lists of different feature types in the main data
# frame, based on how each type will need to be preprocessed.
# i. All 18 categorical features needing one-hot encoding.
# Includes the 4 categorical features originally
# mis-identified as having been normalized:
# EMERGENCYSTATE_MODE, HOUSETYPE_MODE, WALLSMATERIAL_MODE,
# FONDKAPREMONT_MODE
cat_feat_need_one_hot = [
'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
'FLAG_OWN_REALTY', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE',
'NAME_TYPE_SUITE', 'OCCUPATION_TYPE', 'EMERGENCYSTATE_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'FONDKAPREMONT_MODE'
]
# ii. All 32 binary categorical features already one-hot encoded.
bin_cat_feat = [
'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'
]
# iii. All 2 non-normalized numerical features with skewed distributions
# and negative values. These features will need to have their
# distributions translated to positive ranges before being
# log-transformed, and then later scaled to the range [0,1].
non_norm_feat_neg_values_skewed = [
'DAYS_REGISTRATION', 'DAYS_LAST_PHONE_CHANGE'
]
# iv. All 15 non-normalized numerical features with skewed distributions,
# and only positive values. These features will need to be
# log-transformed, and eventually scaled to the range [0,1].
non_norm_feat_pos_values_skewed = [
'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
'AMT_GOODS_PRICE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'OWN_CAR_AGE'
]
# v. All 4 numerical features with normal shapes but needing to be scaled
# to the range [0,1].
norm_feat_need_scaling = [
'DAYS_BIRTH', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START',
'REGION_POPULATION_RELATIVE'
]
# vi. All 46 numerical features that have been normalized to the range
# [0,1]. These features will need neither log-transformation, nor
# any further scaling.
norm_feat_not_need_scaling = [
'EXT_SOURCE_2', 'EXT_SOURCE_3', 'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BEGINEXPLUATATION_MEDI', 'FLOORSMAX_AVG',
'FLOORSMAX_MODE', 'FLOORSMAX_MEDI', 'LIVINGAREA_AVG',
'LIVINGAREA_MODE', 'LIVINGAREA_MEDI', 'ENTRANCES_AVG',
'ENTRANCES_MODE', 'ENTRANCES_MEDI', 'APARTMENTS_AVG',
'APARTMENTS_MODE', 'APARTMENTS_MEDI', 'ELEVATORS_AVG',
'ELEVATORS_MODE', 'ELEVATORS_MEDI', 'NONLIVINGAREA_AVG',
'NONLIVINGAREA_MODE', 'NONLIVINGAREA_MEDI', 'EXT_SOURCE_1',
'BASEMENTAREA_AVG', 'BASEMENTAREA_MODE', 'BASEMENTAREA_MEDI',
'LANDAREA_AVG', 'LANDAREA_MODE', 'LANDAREA_MEDI',
'YEARS_BUILD_AVG', 'YEARS_BUILD_MODE', 'YEARS_BUILD_MEDI',
'FLOORSMIN_AVG', 'FLOORSMIN_MODE', 'FLOORSMIN_MEDI',
'LIVINGAPARTMENTS_AVG', 'LIVINGAPARTMENTS_MODE', 'LIVINGAPARTMENTS_MEDI',
'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_MEDI',
'COMMONAREA_AVG', 'COMMONAREA_MODE', 'COMMONAREA_MEDI',
'TOTALAREA_MODE'
]
# vii. The remaining 3 features in the main data frame that will be
# re-engineered and transformed into different features
feat_to_be_reengineered = [
'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'DAYS_EMPLOYED'
]
# 2. Separate target data from training dataset.
targets = application_train_data['TARGET']
features_raw = application_train_data.drop('TARGET', axis = 1)
# 3. Because the entire training set from the file application_train.csv
# is being used for training, there is no need at this point to do a
# train test validation split.
y_train = targets
X_train_raw = features_raw
# 4. Use the CNT_CHILDREN feature to engineer a binary
# categorical feature called HAS_CHILDREN. If value of CNT_CHILDREN is
# greater than 0, the value of HAS_CHILDREN will be 1. If value of CNT_CHILDREN is
# 0, value of HAS_CHILDREN will be 0.
CNT_CHILDREN_train = X_train_raw['CNT_CHILDREN']
HAS_CHILDREN = CNT_CHILDREN_train.map(lambda x: 1 if x > 0 else 0)
# Append the newly engineered HAS_CHILDREN feature to the main dataframe.
X_train_raw = X_train_raw.assign(HAS_CHILDREN=HAS_CHILDREN.values)
# 5. Drop the CNT_CHILDREN column from the main dataframe
X_train_raw = X_train_raw.drop('CNT_CHILDREN', axis=1)
# Add the new HAS_CHILDREN feature to the list of binary categorical
# features that are already one-hot encoded. There are now 33 such features.
bin_cat_feat = bin_cat_feat + ['HAS_CHILDREN']
# 6. Use the CNT_FAM_MEMBERS feature to engineer a categorical feature called NUMBER_FAMILY_MEMBERS.
# If CNT_FAM_MEMBERS is 1.0, then the value of NUMBER_FAMILY_MEMBERS will be 'one'. If CNT_FAM_MEMBERS is 2.0,
# then NUMBER_FAMILY_MEMBERS will be 'two'. If CNT_FAM_MEMBERS is 3.0 or greater, then NUMBER_FAMILY_MEMBERS will
# be 'three_plus'.
CNT_FAM_MEMBERS_train = X_train_raw['CNT_FAM_MEMBERS']
NUMBER_FAMILY_MEMBERS = CNT_FAM_MEMBERS_train.map(lambda x: 'one' if x == 1 else ('two' if x == 2 else 'three_plus'))
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
X_train_raw = X_train_raw.assign(NUMBER_FAMILY_MEMBERS=NUMBER_FAMILY_MEMBERS.values)
# 7. Drop the CNT_FAM_MEMBERS feature from the main dataframe
X_train_raw = X_train_raw.drop('CNT_FAM_MEMBERS', axis=1)
# Add the new NUMBER_FAMILY_MEMBERS feature to the list of categorical
# features that will need to be one-hot encoded. There are now 19 of these features.
cat_feat_need_one_hot = cat_feat_need_one_hot + ['NUMBER_FAMILY_MEMBERS']
# 8. Use the CREDIT_DAY_OVERDUE feature in bureau.csv to engineer the binary
# categorical HAS_CREDIT_BUREAU_LOANS_OVERDUE feature. If CREDIT_DAY_OVERDUE for a
# particular borrower ID (SK_ID_CURR) is greater than 0, then the value of
# HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 1. If CREDIT_DAY_OVERDUE for a particular
# borrower ID is 0, then the value of HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 0.
# Filter the bureau data table for loans which are overdue (have a value
# for CREDIT_DAY_OVERDUE that's greater than 0)
bureau_data_filtered_for_overdue = bureau_data[bureau_data['CREDIT_DAY_OVERDUE'] > 0]
def build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(dataframe):
"""
Use the CREDIT_DAY_OVERDUE feature in bureau.csv to engineer the binary
categorical HAS_CREDIT_BUREAU_LOANS_OVERDUE feature. If CREDIT_DAY_OVERDUE for a
particular borrower ID (SK_ID_CURR) is greater than 0, then the value of
HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 1. If CREDIT_DAY_OVERDUE for a particular
borrower ID is 0, then the value of HAS_CREDIT_BUREAU_LOANS_OVERDUE will be 0.
Parameters:
dataframe: Pandas dataframe containing a training or testing dataset
Returns: The dataframe with HAS_CREDIT_BUREAU_LOANS_OVERDUE feature appended to it.
"""
# Create a series called HAS_CREDIT_BUREAU_LOANS_OVERDUE and fill it with zeros.
# Its index is identical to that of the main dataframe. It will eventually be appended
# to the main data frame as a column.
HAS_CREDIT_BUREAU_LOANS_OVERDUE = pd.Series(data=0, index = dataframe['SK_ID_CURR'].index)
# A list of all the borrowers IDs in the main dataframe
main_data_table_borrower_IDs = dataframe['SK_ID_CURR'].values
# For each loan in the bureau data table that is overdue
# (has a value for CREDIT_DAY_OVERDUE that's greater than 0)
for index, row in bureau_data_filtered_for_overdue.iterrows():
# The borrower ID (SK_ID_CURR) that owns the overdue loan
borrower_ID = row['SK_ID_CURR']
# If the borrower ID owning the overdue loan is also
# in the main data frame, then enter a value of 1 in
# the series HAS_CREDIT_BUREAU_LOANS_OVERDUE at an index
# that is identical to the index of the borrower ID
# in the main data frame.
if borrower_ID in main_data_table_borrower_IDs:
# The index of the borrower's row in the main data table.
borrower_index_main_data_table = dataframe.index[dataframe['SK_ID_CURR'] == borrower_ID].tolist()[0]
# Place a value of 1 at the index of the series HAS_CREDIT_BUREAU_LOANS_OVERDUE
# which corresponds to the index of the borrower's ID in the main data table.
HAS_CREDIT_BUREAU_LOANS_OVERDUE.loc[borrower_index_main_data_table] = 1
# Append the newly engineered HAS_CREDIT_BUREAU_LOANS_OVERDUE feature to the main dataframe.
dataframe = dataframe.assign(HAS_CREDIT_BUREAU_LOANS_OVERDUE=HAS_CREDIT_BUREAU_LOANS_OVERDUE.values)
return dataframe
# Build the HAS_CREDIT_BUREAU_LOANS_OVERDUE feature
X_train_raw = build_feature_HAS_CREDIT_BUREAU_LOANS_OVERDUE(X_train_raw)
# Add the new HAS_CREDIT_BUREAU_LOANS_OVERDUE feature to the list of binary categorical
# features. There are now 34 of these features.
bin_cat_feat = bin_cat_feat + ['HAS_CREDIT_BUREAU_LOANS_OVERDUE']
# 9. Use the DAYS_EMPLOYED feature to engineer a binary categorical feature called HAS_JOB.
# If the value of DAYS_EMPLOYED is 0 or less, then HAS_JOB will be 1. Otherwise, HAS_JOB will
# be 0. This condition will apply to all borrowers who had a value of 365243 for DAYS_EMPLOYED,
# which I hypothesized can be best interpreted as meaning that a borrower does not have a job.
DAYS_EMPLOYED_train = X_train_raw['DAYS_EMPLOYED']
HAS_JOB = DAYS_EMPLOYED_train.map(lambda x: 1 if x <= 0 else 0)
# Append the newly engineered FAMILY_SIZE feature to the main dataframe.
X_train_raw = X_train_raw.assign(HAS_JOB=HAS_JOB.values)
# 10. Drop the DAYS_EMPLOYED feature from the main dataframe
X_train_raw = X_train_raw.drop('DAYS_EMPLOYED', axis=1)
# Add the new HAS_JOB feature to the list of binary categorical features.
# There are now 35 of these features.
bin_cat_feat = bin_cat_feat + ['HAS_JOB']
# 11. Translate the 2 non-normalized numerical features that have skewed distributions
# and negative values: DAYS_REGISTRATION, and DAYS_LAST_PHONE_CHANGE
def translate_negative_valued_features(dataframe, feature_name_list):
"""
Translate a dataset's continuous features containing several negative
values. The dataframe is modified such that all values of each feature
listed in the feature_name_list parameter become positive.
Parameters:
dataframe: Pandas dataframe containing the features
feature_name_list: List of strings, containing the names
of each feature whose values will be
translated
"""
for feature in feature_name_list:
# The minimum, most-negative, value of the feature
feature_min_value = dataframe[feature].min()
# Translate each value of the feature in a positive direction,
# of magnitude that's equal to the feature's most negative value.
dataframe[feature] = dataframe[feature].apply(lambda x: x - feature_min_value)
# Translate the above two negatively-valued features to positive values
translate_negative_valued_features(X_train_raw, non_norm_feat_neg_values_skewed)
# 12. Log-transform all 17 non-normalized numerical features that have skewed distributions.
# These 17 features include the 2 that were translated to positive ranges in Step 11.
# Add the 2 features translated to positive ranges above in Step 11 to
# the list of non-normalized skewed features with positive values. This is
# the set of features that will be log-transformed
log_transform_feats = non_norm_feat_pos_values_skewed + non_norm_feat_neg_values_skewed
X_train_raw[log_transform_feats] = X_train_raw[log_transform_feats].apply(lambda x: np.log(x + 1))
# 13. Replace 'NaN' values for all numerical features with each feature's mean. Fit an imputer
# to each numerical feature containing at least one 'NaN' entry.
# Create a list of all the 67 numerical features in the main dataframe. These include all
# 17 features that were log-transformed in Step 12, as well as the 4 normal features that
# still need to be scaled, as well as the 46 normal features that don't need scaling.
numerical_features = log_transform_feats + norm_feat_need_scaling + norm_feat_not_need_scaling
# Create a list of all numerical features in the training set that have at least one 'NaN' entry
numerical_features_with_nan = X_train_raw[numerical_features].columns[X_train_raw[numerical_features].isna().any()].tolist()
# Create an imputer
imputer = Imputer()
# Fit the imputer to each numerical feature in the training set that has 'NaN' values,
# and replace each 'NaN' entry of each feature with that feature's mean.
X_train_raw[numerical_features_with_nan] = imputer.fit_transform(X_train_raw[numerical_features_with_nan])
# 14. Remove the borrower ID column, SK_ID_CURR, from the main dataframe
X_train_raw = X_train_raw.drop('SK_ID_CURR', axis=1)
# 15. One-hot encode all 19 non-binary categorical features.
X_train_raw = pd.get_dummies(X_train_raw, columns=cat_feat_need_one_hot)
# Create a list that includes only the newly one-hot encoded features
# as well as all the categorical features that were already binary.
all_bin_cat_feat = X_train_raw.columns.tolist()
for column_name in X_train_raw[numerical_features].columns.tolist():
all_bin_cat_feat.remove(column_name)
# 16. Replace all 'NaN' values in all binary categorical features with 0.
# Create a list of binary categorical features with at least one 'NaN' entry
bin_cat_feat_with_nan = X_train_raw[all_bin_cat_feat].columns[X_train_raw[all_bin_cat_feat].isna().any()].tolist()
# Replace each 'NaN' value in each of these binary features with 0
X_train_raw[bin_cat_feat_with_nan] = X_train_raw[bin_cat_feat_with_nan].fillna(value=0)
# 17. Fit a min-max scaler to each of the 17 log-transformed numerical features, as well
# as to the 4 features DAYS_BIRTH, DAYS_ID_PUBLISH, HOUR_APPR_PROCESS_START, and the normalized
# feature REGION_POPULATION_RELATIVE. Each feature will be scaled to a range [0.0, 1.0].
# Build a list of all 21 features needing scaling. Add the list of features that
# were log-normalized above in Step 12 to the list of normally shaped features
# that need to be scaled to the range [0,1].
feats_to_scale = norm_feat_need_scaling + log_transform_feats
# Initialize a scaler with the default range of [0,1]
scaler = MinMaxScaler()
# Fit the scaler to each of the features of the train set that need to be scaled,
# then transform each of these features' values to the new scale.
X_train_raw[feats_to_scale] = scaler.fit_transform(X_train_raw[feats_to_scale])
# Rename the dataframe to indicate that its columns have been fully preprocessed.
X_train_final = X_train_raw
print('Entire training dataset preprocessing complete.')
print('Number of columns: {}. Expected: 251.'.format(X_train_final.shape[1]))
print('Number of rows: {}. Expected: 307511.'.format(X_train_final.shape[0]))
print('Number of labels: {}. Expected: 307511.'.format(y_train.shape[0]))
# Fit a LightGBM classifier to the entire training set using the parameters
# that were tuned in the final refinement step above.
# Convert the entire training dataframe to LGB format
lgb_training = lgb.Dataset(X_train_final, y_train)
# Final parameters for LightGBM training
params = {}
params['learning_rate'] = 0.001
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'auc'
params['num_leaves'] = 200
params['max_depth'] = 20
params['max_bin'] = 110
params['lambda_l2'] = 0.1
params['bagging_freq'] = 1
params['bagging_fraction'] = 0.95
params['bagging_seed'] = 1
params['feature_fraction'] = 0.9
params['feature_fraction_seed'] = 1
params['random_state'] = 42
# Fit the LightGBM classifier to the training data
clf_lgb = lgb.train(params, lgb_training, 15000)
# Build a prediction pipeline for the testing data table (application_test.csv) that
# saves prediction probabilities to a CSV file, which will then be submitted on Kaggle.
def testing_data_table_predictions_to_csv(clf, testing_data_table, isLightGBM):
"""
A prediction pipeline that:
1. Preprocesses the 48,744 row testing data table
2. Uses a classifier to compute estimates of the probability of the positive
class (TARGET=1) for each borrower: the probability estimate of each borrower
making at least one late loan payment.
3. Saves a CSV file that contains probabilities of target labels for each
borrower (SK_ID_CURR) in the testing data table.
4. isLightGBM: Boolean, a flag that indicates whether or not the classifier is
LightGBM. If True,
Parameters:
clf: A machine learning classifier object that has already been fit to
the training data.
testing_data_table: Pandas dataframe containing the testing dataset.
"""
# Get a list of the borrower IDs (SK_ID_CURR column). The borrower ID must be
# placed in each row of CSV file that will be created.
borrower_IDs = testing_data_table['SK_ID_CURR']
# Preprocess the testing data table so that predictions can be made on it.
X_test_final = test_set_preprocessing_pipeline(testing_data_table)
#print('application_test.csv testing set processing complete. The processed dataframe now has {} columns. Expected: 251.'.format(X_test_final.shape[1]))
# Classifier's estimates of probability of the positive class (TARGET=1): the
# probability estimate of each borrower making at least one late loan payment.
# If classifier is LightGBM, the method for making predictions is merely 'predict'
# and the arrray containing these probabilities has slightly different shape than
# those produced by the other classifiers.
if isLightGBM:
clf_y_score = clf.predict(X_test_final)
else:
clf_y_score = clf.predict_proba(X_test_final)[:, 1]
# Create the CSV file that will be saved
file_output = 'dellinger_kaggle_home_credit_submission5.csv'
# Write to the CSV file
with open(file_output, 'w') as csvfile:
writer = csv.writer(csvfile)
# Write the header row
writer.writerow(['SK_ID_CURR','TARGET'])
# Write a row for each borrower that contains the
# prediction probability of their label.
for index, value in borrower_IDs.iteritems():
writer.writerow([value, clf_y_score[index]])
# To submit to Kaggle: the LightGBM Classifier's predictions on full featureset.
# Create predictions on the data in the testing data table (application_test.csv)
# using the LightGBM classifier fit above. Also create a CSV
# file containing the prediction probabilities for each borrower ID (SK_ID_CURR)
# in the testing data table.
testing_data_table_predictions_to_csv(clf_lgb, application_test_data, True)